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When I first began collecting SNOBOL! programs for a book, I 
had two major misgivings. First, I wondered whether there 
would be enough material and second, I wondered whether the 
programs would be sufficiently nonobvious to warrant publica- 
tion. Both fears slowly evaporated. On the one hand, the 
range of SNOBOLU applications is as wide as the spectrum of 
computer uses and this, it seems, is well-nigh inexhaustible. 
Indeed, an entire book of algorithms and algorithmic  techni- 
ques has recently appeared [Aho et al, 1974] in which the 
range of applications and techniques when intersected with 
that of my own book approximates the empty set. It gives one 
pause to contemplate the complement of both sets. In the end, 
I had a considerable amount of material left over and so my 
one fear was baseless. 


As to my other concern, I was happy to discover in the course 
of writing the book many new and nonobvious ways of program- 
ming in SNOBOL4Y (not all of my own discovery) so that I can 
now be confident that the collection of routines are more than 
merely exercises in the use of the language. Indeed, some 
routines or techniaues were previously believed to be im- 
possible to write in SNOBOL4. For example, employing SNOBOLU 
patterns directly in the compilation process, dynamically 
loading SNOBOLU functions on a call basis, and determining the 
compilation numbers of statements compiled at execution time 
are three problems encountered during the development of 
production programs which were previously thought simply not 
doable in the language. These are relatively easily achievable 
by techniques described in this book (see Programs L ONE 
(18.2), DEXTERN (14.2) and LPROG (11.5) respectively). Since 
I have been a SNOBOL programmer for over a decade and since I 
am still discovering how to do things in the language, the 
reader may conclude either that I am a dunce or that the 
designers of SNOBOLY have created a very flexible and powerful 
language that deserves further study and wider use. The 
remainder of the book will convince him, I hope, that it is 
the latter and not the former. 


Another, less prominent, concern was the relative obscurity of 
the  SNOBOI! language. While more widely used and available 
than most languages, it is not so ubiquitous as say Fortran or 
Cobol. For a variety of reasons such as cheaper machines it 
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is not hard to visualize a future in which SNOBOLU, or at 
least a SNOBOLU-like approach to life, will play a more promi- 
nent role. Also the quest for simplicity of programming may 
ultimately be achieved by way of semantic richness rather than 
by feature elimination. 


Viewed most generally, the book is a collection of algorithms 
with SNOBOIU used as a communication vehicle. The algorithms 
are decidedly oriented toward the nonnumerical as this is 
SNOBOL4*s forte and as such tend to supplement other published 
algorithms such as those appearing in the Communications of 
the ACM which, due to the reliance on Fortran and Algol, are 
primarily mathematical in nature. Because of its nonnumerical 
character, the book should be especially helpful to artisans 
in the humanities and in business applications as well as to 
the information scientists to whom the work is primarily ad- 
dressed. The reader is assumed to know or be learning SNOBOLU 
and if his knowledge in this respect is a little weak he 
should be willing to consult an appropriate manual or primer 
for reference. Little or no assumption is made with respect 
to his knowledge of other areas of computer science and 
mathematics. 


As a collection of SNOBOL4 algorithms, the book lends itself 
for direct use by the growing number of SNOBOL4Y programmers 
who may use the programs as is, or modify them to suit their 
particular application. To further this end, virtually all 
programs are written as functions with a conscientiously ap- 
plied naming system so that they can be simply 'plugged in! to 
existing programs without disturbing things. Hence another 
purpose is served, i.e., to foster and illustrate a technique 
of well-structured modular programming which is all too fre- 
quently lacking in many SNOBOL4 programs. There is currently 
great interest and for good reason in goto-less structured 
programs and while the control structures of SNOBOLU prohibit 
adherence to the letter of this dictum, the examples in this 
book serve to carry out its spirit. 


The SNOBOL& programmer will find much information of an im- 
plementation nature not available elsewhere. Most of this is 
intended to guide him in the writing of more efficient 
programs but some SNOBOL4 lore is included for his general in- 
formation. An effort has been made to describe pattern mat- 
ching more fully and comprehensively than it has been 
heretofore as this has been one of the murkier aspects of the 
language. 


Finally, the large number of SNOBOL4 example programs should 
complement well a SNOBOLU primer or manual in teaching the 


lanquage. This author's experience has been that programming 
languages as well as natural languages are most easily taught 
by varied and intriguing examples. Not only is interest 


heightened and motivation increased, but the example carries 
the student forward on a familiar framework and provides a 
convenient gestalt for later recall. Because of this use as a 
supplementary text, various features of the language are com- 
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partmentalized in the early chapters so that their introduc- 
tion can be synchronized with a course of instruction. In fact 
the author has used notes from this book very successfully in 
teaching a course in nonnumerical programming to members of 
the staff at Bell Laboratories and to graduate students at 
Stevens Institute of Technology. A number of exercises .have 
been included to extend its usefulness in the classroom as 
well as to suggest possible modifications of the routines 
themselves. 


The alert reader will note that the book was prepared by a 
computer. This was done to permit the automatic testing of 
the programs. TO remain faithful to this idea, all figures, 
titling, paragraph illumination, etc. were done without suc- 
cumbing to the temptation of later  touchup. Chapter 10 
describes in detail some of the routines used in the book's 
production. 


The programs, as presented, are directly applicable to the IBM 
360 implementation of SNOBOL4Y and SPITBOL. In virtually all 
cases, these programs can be used with SNOBOLS processors 
(including SITBOI) on other machines without change or, at 
most, by a transliteration of characters. 


The writing style has been chosen to be direct, informal and 
sometimes even cheerful. It is hoped that occasional lapses 
into whimsy (not expunged by the final version) do not disturb 
the reader; the intent is not so much to amuse as to present 
a welcome relief to the frankly difficult task of reading and 
interpreting programs. 


A number of individuals have contributed in one way or another 
to the production of this book. Thanks go to Frank Boesch, 
Len Bosack, Fran Brophy, Steve Chen, Bob Dewar, Ralph 
Griswold, Scott Guthrey, Dave Hanson, Cass Lewart, J. C. Noll, 
Ivan Polonsky, Mark Rochkind, Larry Samberg, Dick Stone, and 
Jane Walsh. A special appreciation goes to Ralph Griswold who 
taught a Computer Science course at the University of Arizona 
from an early computerized draft of Chapters 2-5 and provided 
valuable feedback. I am flattered that he was able to expand 
on this material to produce an excellent and very readable 
book [Griswold 1974a}. Those having difficulty reading the 
early chapters here may wish to consult this text. 


Finally, thanks go to the management and staff of Bell 
Laboratories whose consent, cooperation and computers have 
made this text possible. 


James F. Gimpel 
Holmdel, New Jersey 
May 1, 1975 
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| $** lIlgorithms and Programs | An algorithm is a sequence 
1 $ $ m of self-evident steps for 
(89$ % | carrying out some activity. A familiar example of 
| $*5* | an algorithm is the procedure for ‘long! multiplica- 
(£ $| tion which multiplies two numbers which are bigger 
CV than the operands in a memorized table. The notion 


of algorithm is actually quite old going back several thousand 
years B.C. [Knuth 1972], and the word 'algorithm' has a long 
and convoluted etymology [Knuth Vol.1, p. 1-2]. 


We say an algorithm is composed of "self-evident steps" to 
rule out some such phrases as "add salt to taste", or "apply 
sward to mainskee according to Fig. 3". That is, each step 
can be mechanically carried out without assistance from a 
human being. But it is interesting to note that the definition 
of algorithm is not a rigorous one, since no one can ever give 
an all-inclusive definition of "self-evident step". What we 
generally do is devise a special language within which each 
operation is carefully defined, and this language is used to 
express all algorithms. Thus we can devise a special machine 
language as was done by Knuth [Vol. 1-3], or we may devise a 
matching and replacement operation as was done by Markov 
[1954], or invent a dialect of some existing language, such as 
Pidgin ALGOL {Aho et al, 1974], or we may use an existing 
programming language, such as is used in the Algorithms sec- 
tion of the Communications of the ACM. In this book we will 
use an existing language, viz. SNOBOL4 [Griswold et al, 1971]. 


This means that our collection of techniques are not merely 
algorithms, they are programs as well. Since there is some 
question (not to mention controversy) as to the distinction 
between algorithm and program [ACM Algorithm Letters, 1966 and 
ACM Forum, 1974-1975], it is perhaps worth our trouble to 
consider these two notions. An algorithm is a method, distinct 
from any external form, and distinct from any language. On 
the other hand, a program is a sequence of characters which 
will implement some process. For example, we may say that a 
program is 332 characters long, but we may not say such a 
thing about an alaorithm, because an algorithm may be im- 
plemented in several different languages producing programs of 
various lengths. TO communicate the algorithm to another human 
being, we generally require its formulation in terms of 
concrete symbols. Any such formulation may be said to be a 
program. Hence, on the surface at least, the notions of al- 
gorithm and program would seem to bear the same relationship 
to each other as the notions of function and expression in 
mathematics. That is, one is a representation of the other. 
However, the analogy is somewhat imperfect. Programs are 
generally written to be run on a digital computer, and, as 
such, tend to communicate an algorithm to a machine, as op- 
posed to another human being. Programs are a medium whereby a 
process is effected, and hence are, as it were, part of the 
machinery. We may therefore expect them to reflect 
idiosyncrasies not part of the original pure algorithmic no- 
tion. That is, programs may be dirty. On the other hand, 


programs, when coupled with an appropriate linguistic proces- 
sor, can actually carry out the activity for which they are 
designed. In short, they work. 


Although in principle an algorithm is independent of the par- 
ticular language in which it is expressed, in practice, this 


is an impossibility. This is because, as the notion of self- 
evident step varies, the techniques employed to carry out an 
overall activity will vary. Thus, a method to compute a hash 


function will depend on what arithmetic operations (such as 
Givision) are available. Random number generators will depend 
not only on what operations are present, but on whether some 
forms of arithmetic overflow are permitted. Certainly, string 
algorithms implemented in a Markov language such as SNOBOLG, 
which permit string scanning as a fundamental operation, will 
appear entirely different than when written in some other 
language. This is unavoidable and is, of course, one of the 
purposes of a text like this one. 


There is currently heightened interest in both algorithms and 
in programs. For example, there is a famous problem in graph 
theory called the Koenigsberg Bridge Problem. The problem 
calls for a path leading across all edges (bridges) of a graph 
without traveling along any edge twice. A constructive 
procedure for finding such a path was furnished by Euler in 
1736; this has long been regarded as the starting point of 
modern graph theory. However, it was not until 1973 [Edmonds 
and Johnson] that anyone specified a method for finding such a 
path in an amount of time proportional to the number of edges. 
This particular example is only typical of a general trend. 
We are no longer content with knowing that a procedure can be 
carried out, nor even with how such a procedure can be carried 
out. The thrust of much computer science activity is in deter- 
mining how effective a particular algorithm is, and in care- 
fully specifying an algorithm to maximize efficiency. 


Another area of waxing interest is in determining the proper 
form of a program. Virtually unheard of five years ago, the 
term 'structured programming! has captured the fancy of the 
computing fraternity and, at this writing, is perhaps the most 
used (and abused) term in the literature's lexicon. While the 
term means many things to many people, the general idea is 
that many of the ills plaguing the software industry are 
traceable to the fact that we are incapable of properly struc- 
turing large complex tasks. While we can study the strategy 
of structuring from a language-independent point of view, many 
of the tactics in forming clear and cogent code depend on the 
particular tools at one's disposal. Hence, another purpose of 
this text is to discuss and present methods of organizing, 
i.e., structuring, SNOBOL4Y programs. 
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| EEE NOBOLY ORIGINS | Programs written in SNOBOLU tend to 
EE r be oriented toward the manipulation 
| SERS | of strings. A string is a sequence of characters 
1 $ | and a character is any of the various letters, 
| S% | digits, logograms and punctuation symbols (including 
L—————J the blank) that one might punch on cards or type on 
an electronic terminal. The stream of characters you are 
reading now is an example of a string. It has, in fact, been 
subjected to some of the algorithms to be described in this 
book. 


String processing includes the testing, comparing, scanning, 
rearranging, transliterating, transforming, inserting, 
crunching, and deletion of strings. Since programs and data 
are normally entered into a digital computer in the form of 
strings and since all data printed is in this form, it might 
seem that string processing is, and always has been, in the 
forefront of computer studies. But this is hardly the case. 
Historically, string processing has been something of a step- 
child of computation. 


The computer was initially perceived as a machine whose 
primary purpose was performing numerical computations. Getting 
numbers and programs into the machine was considered inciden- 
tal to computing rather than occupying any central role. In 
fact, to program an early machine, one did not use characters 
at all, but wired up a plug board. A single program took weeks 
of effort. Humans began to realize that they were more like 
slaves to the machine than high-priests as they were forced to 
do an inordinate amount of work just to keep the machine busy. 
Alt [1972] recalls that, as early as 1947, the team of 
programmers for the ENIAC discovered a method whereby they 
could enter programs by merely dialing digits rather than 
wiring plug boards. To do this they wired the plug-board con- 
trol permanently in such a way that the machine read the 
digits and performed associated instructions in much the same 
way that a modern interpreter might do. This seems to be the 
world's first higher level language. At any rate, the machine 
slowed by a factor of five but the technique was the preferred 
one thereafter. Why? Was it because men are lazy and they 
want the machine to do all the work? Well, there is a way to 
express this less argumentatively. The machine was so success- 
ful at performing arithmetic that the bottle-neck shifted away 
from calculations with numbers to the logistics of presenting 
the problems to the machine. In many ways this problem is 
still with us. 


Peripheral devices for reading characters from paper tape and 
cards had existed for some time and it did not take long 
before such devices were attached to the machine for 
input/output. More importantly, machines were beginning to be 
designed with the stored-program concept which meant that plug 
boards did not have to be wired for each different program. 
Rather, like the trick used with the ENIAC, the machine would 
translate numbers into instructions, but with the important 
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difference that the numbers did not have to be set manually. 
They could be read from some external device or they could be 
computed; in particular, they could be produced by some other 
program and the Great Age of computer languages was born. From 
this point on, the evolution of machine design gave way to an 
evolution of languages, in much the same way that human 
biological evolution has given way to a cultural evolution. 
Although the components have changed to give us cheaper, smal- 
ler, more efficient machines, the machine organization has 


remained essentially the same (the Von Neumann Machine). In 
this organization main storage consists of an aggregate of 
words each addressable by some assigned number. The data 


within this storage is entirely unstructured as seen by the 
hardware. Complex data such as strings, patterns, arrays, etc. 
are only such in the eyes of the software, not as viewed by 
the hardware. 


The first programming languages were, of course, assembly 
languages in which generally there is a one-to-one  correspon- 
dence between lines in the source language and machine 
instructions. The assembler's job is essentially to translate 
from names (suitable to humans) to numbers (suitable to 
machine). This is unnatural for a machine to do and it was 
resolved essentially by a mechanism known as a symbol table 
(see Chapter 11). The use and disposition of a symbol table 
is key to the implementation and understanding of many 
programming languages in addition to assemblers. 


A rather impressive advance was made by the Fortran language 
which was developed in the mid-1950's. This language was so 
well designed that today it is perhaps the most widely used 
programming language in spite of regular denunciations by the 
academic community. Fortran opened up computation to a large 
number of programmers who would need to know nothing or very 
little of the internal organization of the machine in order to 
Start programming (although they usually wind up having to 
know a great deal). Now an important point to note in connec- 
tion with Fortran is its peculiarly numerical orientation. The 
tools provided to the Fortran programmer were totally dif- 
ferent than the tools required by the system programmers who 
had to write assemblers, operating systems and the Fortran 
compiler itself. Fortran had, for example, a rich mathematical 
library containing trigonometric functions, exponentiation, 
etc. which the writers of Fortran had absolutely no need for; 
on the other hand, Fortran lacked string, character, bit and 
address data objects which are essential to 'systems' work. 
Although a step away from the numerical was made in that the 
language gave the machines the ability to accept programs in 
human style, it was assumed that the end use would be ‘number 
crunching'. 


The first non-numerical language of consequence was IPL 
[Newell 1957]. This language was developed as a by-product of 
some experiments in artificial intelligence by Newell, Shaw 
and Simon in which an attempt was made to mimic the thinking 
patterns of human beings. In particular, the mental processes 
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involved in theorem-proving were explored [Feigenbaum and 
Feldman 1963). IPL is a list-processing language. All data 
is in the form of lists; the components of a list may be other 
lists or basic non-decomposable units which are actually ad- 
dresses referenced symbolically as in an assembler. Numerous 
built-in functions are available to manipulate lists. In fact, 
an IPL program is itself a list. The arch-difficulty of IPL 
is its syntax which is forbiddingly like assembly language. 


IPL was soon followed by LISP [McCarthy 1960] which overcame 
some of the syntactic difficulties of IPL. Rather than place 
components of a list vertically down the page with symbolic 
reference to sublists, LISP provided a more abbreviated 
horizontal notation with nested parenthetical expressions to 
denote sublists. Moreover, the basic nondecomposible unit, 
called the atom in LISP, was a string. In LISP, large strings 
were represented as lists of atoms, and atoms, as their name 
suggests, could not be decomposed. 


A list was the first data object whose size was not fixed for 
the duration of the program but which could vary as required. 
Lists are particularly useful in problem areas which are not 
well understood and cannot, or at least, have not been reduced 
to easily computable mathematical formulas. Hence list struc- 
tures have been a favorite form of data for artificial intel- 
ligence applications. 


COMIT is often considered the first true string processing 
language. Unlike LISP, the strings of COMIT can be arbitrarily 
manipulated not by rearranging pointers between fixed strings 
but by completely rearranging the characters (and hang the 


cost). With COMIT the string had become a data object; a 
variable (of sorts) could range over the entire set of 
strings. These variables were called ‘shelves! and were 


referenced by shelf number. A very powerful process called 
pattern matching could be applied to such strings and matched 
substrings could be replaced by other strings. COMIT has one 
major deficiency; one may not use ordinary common names such 
as S, LIST, or BILL to denote variables as one might do with 
numerical variables in Fortran or even assembly language. 


The pattern matching notation entered COMIT by way of 
linguistics where the notation is quite old. The notation was 
studied in depth by Markov [1954] who treated the replacement 
operation as a fundamental algorithmic component and showed 
that all computations were possible using replacement alone. 
Languages such as COMIT and SNOBOLU are sometimes referred to 
as Markov languages even though there is no evident historical 
connection. 


Early work at Bell Laboratories in string processing included 
the development of a language called SCL (Symbolic Communica- 
tion Language) by Lee, et al [1962]. SCL extended the 
facilities of COMIT for string processing but had several 
deficiencies including an ungainly assembly-language syntax 
and the absence of variable names (as in COMIT). SCL had cer- 
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tain unique and valuable features such as a run-time  compila- 
tion and execution of strings, but its most valuable contribu- 
tion was that it provided a gestation period for SNOBOL. 


SNOBOL (Farber et al, 1964] combined two very important ideas, 
the string processing and pattern matching of COMIT and the 
symbolic referencing of variables. Thus for the first time in 
any major language (and possibly ever), a programmer could 
write: 


A = B C 


to indicate in a simple and natural way that the string B 
concatenated with the string C is to be assigned to the string 
A without disturbing the values of either B or C. The pattern 
matching operation of COMIT could be invoked in a similarly 
convenient and concise fashion. Thus for the first time, 
strings of characters could be manipulated with the notational 
ease that Fortran provided for numbers. 


Unlike Fortran, however, no simple easy translation existed 
into machine orders. On the IBM 7090, on which SNOBOL was 
first implemented, concatenation was a complex process re- 
quiring the shifting of characters through an ungainly 
accumulator. Also, the use of variables whose values cannot 
be destroyed complicates further the operation of concatena- 
tion. Thus, we cannot merely direct a pointer from B to C to 
effect the above concatentation as this would alter B. We 
cannot copy C onto the tail end of B as this would destroy 
other data. Rather, a separate section of core is allocated, 
the strings B and C are copied in, and a pointer is directed 
from A to the new storage. Since storage is being generated 
continuously, a process of storage recovery (garbage collec- 
tion) is required. Thus, the apparent simplicity requires a 
rather considerable software system to support it. It is not 
surprising that it appeared relatively late on the programming 
scene. 


SNOBOL's successors, SNOBOL3 [Farber et al 1966] and SNOBOL4 
[Griswold et al 1968], while retaining the simple and powerful 
notation of the original SNOBOL, greatly extended and 
generalized its facilities. In fact, it is no longer accurate 
to characterize SNOBOLY as a string language, since its 
facilities extend considerably beyond string manipulation. 


Ce a oe. 

K*€* he Future | How well may we expect SNOBOLU to fare in 
$  —————— the future? Certainly, this is an in- 
% | triguing question to ask of any language and one 
$ | which is extremely difficult to answer. To a first 
% | approximation, the success of the language will 
L————4 depend on the future importance of nonnumeric data 
processing. Although numerical programming will doubtlessly 
increase in the future, non-numerical processing should 
increase even faster. This is due to the economics of the 
Situation. A computer can multiply two 8-digit numbers 
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together in approximately 6 microseconds whereas it takes a 
human about 60 seconds. The computer is therefore 107 times 
(or 7 orders of magnitude) faster at this activity than 
humans. On the other hand, to take a typical string-processing 
problem, a computer, carefully programmed, will require about 
two millisconds to scan a paragraph containing 1000 characters 
for some string such as 'ALPHA', whereas a human will require 


approximately 20 seconds. Hence, the machine for the non- 
numeric problem is only 10% (or 4 orders of magnitude) faster 
than the human. Hence, the machine is better at numerical 


processing by about 3 orders of magnitude. Since historically 
computers have been much more expensive than humans it is un- 
derstandable that they have been applied mostly in those areas 
with a strong arithmetic flavor. 


Another factor to consider in comparing the two kinds of 
processing is input/output (i/o). Two numbers that are mul- 
tiplied together typically do not come from typed data but are 
the result of other computations within the machine. But the 
string that is being scanned for the word ‘ALPHA' has 
generally entered the machine from some i/o device such as 
disk, tape or terminal. If we consider disk as typical we find 
that this device transmits 10,000 characters in a total time 
of about 100 milliseconds so that our paragraph to be scanned 
requires 10 milliseconds. Multi-programming operating systems 
help somewhat to alleviate the problems of delay time due to 
disk i/o by transferring control to another resident program 
while i/o is in progress but the program doing i/o must remain 
resident in main storage thereby consuming resources. If we 
add a factor for the inefficiency of the transfer of control 
process and the time expended in transporting the characters 
from the main storage receiving stations (i/o buffers) into 
work areas we arrive at a figure very much like ten mil- 
liseconds anyway. The net effect is that if the string to be 
scanned is also read and written we increase the cost of 
string processing by another order of magnitude. 


Another difficulty with string processing that has helped hin- 
der its more rapid development is that string operations are 
by no means standardized at the machine level. Thus, string 
processing is not only slower, it is more complicated. In 
Fortran, the statement: 


X = Y* Z 
results in three instructions, LOAD Y, MULTIPLY by Z, and 
STORE into X. No such corresponding instruction sequence can 
be produced for typical SNOBOLU operations such as pattern 
matching or concatenation. Not only do these operations re- 
quire more instructions but the methods vary from machine to 
machine. To begin with, the method of representing strings 


varies [Madnick 1967]. Representational decisions such as 
whether to store one character per word or several characters 
per word may depend on machine characteristics such as whether 
characters are directly addressable. Another important dif- 
ference is how string values are bound (assigned) to 
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variables. For example, in PL/I the only very efficient string 
representation is to allocate a given storage area of maximum 
size for each string variable. On the other hand, an implemen- 
tation of the SNOBOL4 language requires that a pointer be 
associated with each variable which points to the actual 
characters. This may seem like a minor difference but it is 
not; in the PL/I approach a simple string assignment such as: 


results in copying the string. In SNOBOLU, only the address 
is copied. However, the latter method implies the necessity 
to garbage collect whereas the former does not. That is, if 
S1's pointer is overwritten by another pointer, the old string 
pointed to by S1 may no longer be needed. Experience shows 
that we cannot afford the luxury of retaining every string 
ever referenced in a string-processing application, and so, 
obsolete strings must be discarded. 


Even fixing on a common data representation, the method of 
scanning a string S for a substring, say 'ALPHA', can vary 
considerably. The IBM 360/370 contains a TRT* instruction 
which enables the machine to quickly scan a string for one of 
a set of characters. Thus, we might rapidly scan the string S 
for the lead character 'A' thus increasing the scan rate. But 
time is required to set up this rapid scanning. For short 
strings or for strings containing many A's it would be more 
economical not to use this special scan. Even given the rapid 
scan ability, it is not clear that 'A' should be the character 
searched for. If we assume that P's occur less frequently than 
A's then a rapid scan for the letter 'P' should be made. Given 
any such 'P! we can then check for the characters  'AL' 
directly before and 'HA' directly after. 


The setup tradeoff is not unique to the 360/370 architecture. 
For many machines a fast inner loop can be written to test for 
a specific character that will be faster than a loop to test 
for an arbitrary character (which is, say, in a register). If 
one is willing to invest time in forming characterizations of 
the subject string (the string being scanned) one can perform 
a kind of hash test [Harrison 1971] which is very fast. This 
is inefficient, however, unless the subject string will be 
scanned repeatedly. 


The complexity involved in specifying string algorithms 
becomes significant in several ways.. The languages for string 
processing must call functions rather than compile in-line 
code and the linkage overhead further slows down computation. 
In fact, most implementations tend to be interpretive which 
greatly reduces the speed of numerical operations if, for sim 
plicity, these are also treated interpretively. Complex 
language processors cannot be built as rapidly and any string 


E 


*TRT stands for TRanslate and Test. This is a misnomer; 'Scan 
and Test' would have been better. 
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language will experience more difficulty in being reproduced 
On some other machine. When a processor, such as the macro 
implementation of SNOBOL4Y, attempts to be machine-independent, 
it must sacrifice efficiency significantly. For example, the 
macro implementation of SNOBOLU will scan a string for a sub- 
String at the rate of 40 microseconds per character (on the 
IBM 360/Mod 65) a full order of magnitude slower than is 
possible on that machine essentially because of its machine 
independence. The most efficient utilization of any machine 
for typical string operations requires in general a complete 
restructuring of the program and this tends to inhibit the 
rapid spread of any language. 


The complexity issue becomes important when one realizes that 
the very great strides in producing economical computation in 
the last several years have come in the form of minicomputers 
and microcomputers. These machines tend to be small, new and, 
as is characteristic of a new industry, exhibit a relatively 
large number of different designs. All three factors tend to 
work against a large ambitious SNOBOL-like language. 


As the early ENIAC programmers discovered, however, very few 
problems are so purely numerical that the machine can be 
casually fed problems and spew out answers. In fact, most of 
what mankind wants done is non-numerical and is difficult if 
not impossible to program. By contrast, those problems which 
are very numerical have probably already been programmed or 
are embedded so intricately in an essentially non-numerical 
setting that the numerical part can't be brought easily to the 
machine. TO consider just one example, the filling out of 
one's income tax can be done conversationally from a computer 
terminal; the amount of computation that must be performed is 
insignificant compared to the total programming required to 
make the system usable by the 'unwashed' (naive) user. Hence, 
if we are to extend the application of computers to new areas 
there will surely be much about these areas that is non- 
numerical. 


Ra GM CIIM LIN CE KENN CM E EOM CES | 

| FRE NOBOLY Implementations | SNOBOL4 was developed during 
| $ It period of computer 
{ ERE | changeover at Bell Laboratories and so the language 
| $ (| was written in a system of macros [Griswold 1972]. 
| #448 | In this way, the language could relatively easily be 


C—————2 transported to the new machine (whatever it was 
going to be). This had the fortunate consequence of making 
SNOBOLU transferrable to other different machines with far 
less difficulty and with much greater faithfulness to the 
original design than would otherwise have been possible. This 
implementation is usually referred to as the MAcro 
Implementation of SNOBOLU; we will refer to it throughout as 
MAINBOL. 


While MAINBOL is relatively portable, it is also inefficient. 
This is due primarily to its machine independence. A fair 
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estimate cof the cost of machine independence in the case of 
SNOBOL4 is a factor of two in both space and time. 


SPITBOL [Dewar 1971] was developed to overcome the inef- 
ficiencies of SNOBOLU, at least for the IBM 360. By writing 
exclusively in assembly language, by developing new techniques 
for string handling and storage management, and by compiling 
executable code rather than running interpretively, SPITBOL 
was able to better the running speed of MAINBOL by a factor of 
7 (this was a median figure of 21 programs tested at Bell 
Laboratories).  SPITBOL is also smaller than MAINBOL by a fac- 
tor of two. It should also be pointed out that SPITBOL not 
only did not compromise with the language which so often hap- 
pens when a language is reimplemented from scratch, but 
actually extended the language in several significant ways. 


The SITBOL processor [Gimpel 1973a & 1974] is a completely new 
implementation of the SNOBOLU language for the PDP-10. SITBOL 
benefitted greatly from the SPITBOL experience, using and im- 
proving upon the implementation innovations of  SPITBOL. 
Although SITBOL is an interpreter, it is faster than MAINBOL 
by a factor of from 3 to 5 and is smaller by a factor of 3. 
SITBOL is upward compatible with both SNOBOLY and SPITBOL and 
contains many language enhancements as well. These three im- 
plementations are discussed more fully in Chapter 11. 


While these are the only implementations that can claim to 
support a full SNOBOL4Y, the FASBOL implementation [Santos 
1971] should also be mentioned. This ambitious project is in- 
tended to produce a compiler for SNOBOLU that, in addition to 
Obtaining high speed, supports separate subroutine  compila- 
tion, compiled patterns and in-line arithmetic. FASBOL, 
however, lacks several SNOBOL4Y features and many of the 
programs in this book will therefore not run under that 
system. 


Wot ee CE EID MEUSE En 

| ERE NOBOL4Y foibles | Winston Churchill's famous statement 
| $ £——————————————4 about democracy can be made with 
| £888 | particular aptness about SNOBOL4. It is the worst 
| $ | of all programming languages, except for all the 
| $#$% | rest. By this we mean that SNOBOL4 is a very effec- 


t———— tive programming language not because it is free of 
blemish, it actually has quite a few, but because of the many 
valuable features which it does have. In my own experience, 
unless the problem is totally numerical, a  SNOBOLU program 
will be at most half as large as one written in some other 
language to achieve the same effect. In some cases the reduc- 
tion in size and complexity is indeed dramatic. SNOBOLU 
achieves this code condensation by providing a number of 
facilities simply not available in most other languages. These 
include pattern matching which is so rich as to amount to a 


language within a language. The storage allocation facility, 
while conceptually simple, completely frees the user from 
concern over the detailed disposition of data objects. All 


data objects are represented by a descriptor of fixed size. 


This makes it possible to have heterogenous arrays, 
declaration-free variables and structures, and, most impor- 
tantly, it allows data objects to be freely transferred bet- 
ween calling and called functions. The historic tendency of 
interpreters to include symbol tables during execution leads 
to a number of facilities not normally available. These 
include indirect referencing, indirect goto's, dynamic defini- 
tion of functions and structures and, the ultimate source of 
freedom and flexibility, the ability to compile and execute 
arbitrary strings. It has a comprehensive tracing and error 
recovery facility and the ability, through numerous keywords, 
to provide the user with all sorts of information concerning 
his running program. 


In general, the power and flexibility of SNOBOL4U are une- 
qualed. While the language can be abused, as many languages 
can be, it has many features which, properly employed, enable 
large programs to be written with a minimum of difficulty. 


This is not to suggest that the language is entirely free of 
defect. As in any ambitious project of SNOBOLU's magnitude, 
there are many minor deficiencies. Moreover, merely knowing 
about them does the language designer no good. Liabilities 
get 'frozen'! into a language since it is impolitic to make 
non-compatible changes. For casual SNOBOLU programming we may 
ignore many of these deficiencies. When composing large 
programs, however, it is much more important to develop a 
systematic approach and we must confront these defects 
squarely. 


As remarked by Dunn [1973], a language which is very inef- 
ficient can be a burden to use even though the application, 
such as bootstrapping, is not nominally one demanding high ef- 
ficiency. Dunn was critical of SNOBOLU in this regard but his 
remarks were actually directed to a specific implementation, 
MAINBOL. As Hanson [1973] remarks, the inefficiencies noted 
in using MAINBOL do not apply to SPITBOL and  SITBOL. Our 
remarks in this critique will be directed only to the SNOBOLU 
language as described by Griswold et al [1971] and not to any 
particular implementation 


i. Perhaps the most noted deficiency of SNOBOLU, especially 
in an age when the goto is harangued daily, is the lack of 
good control structures. They are admittedly primitive 
[Griswold 1974]. There is no IF ... THEN ... ELSE, and no 
repetition element such as the Fortran DO. One is forced to 
use many goto's and to invent unique label names. This is a 
bother and conventions must be adopted. It is not, however, 
as detrimental to good programming practice as one might 
think, since it generates dependency on the use of the func- 
tion which is a superior control structure anyway. See the 
remarks on Structured Programming. 


2. A number of difficulties involve pattern matching. Pattern 
matching is a complex process and to be used fully requires a 
comprehensive understanding on the part of the user. For this 
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reason two chapters in this book are devoted to a theoretical 
and practical treatment of the subject. But aside from the 
learning problem there are residual difficulties. One of these 
is the one-character assumption which we discuss more fully in 
Chapter 7. The statement below: 


HERE S LEN(1) $C LEN(1) $ D *LGT(C,D = DC :S(HERE) 


should sort the string S as it repeatedly swaps any consecu- 
tive pair of characters not in the correct lexicographic 
order. Unfortunately, if the last two characters are out of 
order they are never swapped because the pattern matching 
mechanism assumes that *IGT(C,D) matches at least one charac- 
ter and that therefore the entire pattern requires at least 
three characters and that it would be a waste of time to try 
the pattern on merely two characters. The manual will say to 
use FULLSCAN mode to circumvent this but, as we will argue 
later, mode switching is not good practice for large programs. 


Predicates may be employed within patterns in spite of the 
one-character assumption if one employs a trick. See Prog. 
8.7. 


3. Another heuristic that gives problems is the length- 
failure, or futility heuristic. Under this assumption, the 
very natural back-referencing operation becomes virtually 
unusable. For example, the pattern matching statement: 


S LEN(3 $ X ARB *X 


examines the string S for a pair of identical three-character 
substrings, if it would only work. The first three characters 
of S are assigned to X and this string is searched for in the 
remainder of S. Upon failing, the next three characters of S 
should be assigned to X and the search should continue. This 
will not happen, however. When *X does not match by reason 
that there are insufficient characters remaining in S, it 
signals ‘length failure' or 'futility' (See Chapter 7 for a 
more detailed discussion of these terms). The scanner believes 
that it can immediately halt all processing and so it does. 
The result is that, unless the first of the pair of three- 
character strings begins with the first character, the pattern 
fails. The error can be cured by FULLSCAN. As indicated in 
the preceding paragraph, however, this introduces other 
problems. 


4. Pattern building, as distinct from matching, also causes 
some problems. The pattern matching statement: 


S LEN(N) . K = 


removes the first N characters from the string S and assigns 
them to the variable K. Unfortunately, the pattern must be 
constructed each time the statement is executed. The cost of 
building the pattern with the concomitant garbage collection 
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will require more time than the pattern match itself. A solu- 
tion is 


P = LEN(*N) . K 


S P = 


Although this can serve to remove the pattern-building opera- 
tion from the ‘inner loop', it creates several other problems. 
One has to think up a unique name (P just won't do in a large 
program). The statement kearing the pattern definition is 
separated from the statement bearing the match. This can cause 
difficulties when trying to decipher a large program. The 
side-effect of setting the variable K without any apparent in- 
dication at the pattern match is poor practice. Finally, the 
use Of *N is awkward. The novice tends to overuse the deferred 
expression and begins to use it where it produces errors. In 
short, the language becomes more confusing, difficult to learn 
and error prone. 


5. It should be possible in any language to write a function 
whose behavior will be invariant with respect to its environ- 
ment. The language that comes closest to this ideal is Fortran 
with its separately compiled subprogram. SNOBOL4Y tends to be 
worse than others in this respect. For example, the function 
X(S), below, will return its string argument rotated one 
character to the right. 


DEFINE ('ROT(S) T!) : (ROT END) 
ROT S RPOS(1) LEN(1) . T = 
ROT - T S : (RETURN) 


ROT END 


This function will behave properly provided (1) LEN, RPOS, 
binary '.' and concatenation have not been redefined, (2) 
RETURN has not been redefined, (3) the &ANCHOR mode has not 
been set, (4) ROT is not used as a label outside the progran, 
and (5) neither ROT, S nor T have been I/O associated. 


6. SNOBOL4 contains no block structure so that problems of 
scope emerge. For example, the function INC(NAME), defined 
below, will increment the named variable. Also, COUNT will 
record the number of times the function was called. 


DEFINE (' INC (NAME) !) : (INC, END) 
INC COUNT = COUNT + 1 
$NAME = $NAME + 1 : (RETURN) 


INC, END 


If COUNT is used outside the function, its current value can 
be destroyed. That is, there is no way to isolate this use of 
COUNT from any other that might exist in a program. One may 
designate that COUNT is local (a misnomer, ‘temporary! would 
be better) to the function. But this would mean that the value 
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of COUNT would be saved before entering the function and 
restored on return and hence could not be used to count the 
number of calls. 


The named variable being incremented by INC may not be ar- 
bitrary. If it were COUNT, then it will be incremented twice. 
If it were INC, then it would be incremented once, but on 
return its old value would be restored. If it were NAME, there 
would be an attempt to add 1 to the string 'NAME' resulting in 
a fatal error. 


7. Function definition is unusually flexible in SNOBOLU, but, 
as has been noted by Abrahams [1974], it also leads to dif- 
ficulties. Since function definition is dynamic, the DEFINE 
must be executed; but where should it be placed? If the DEFINE 
is placed in some initialization section separated from the 
body of the function by some distance, programs become dif- 
ficult to follow. To place the DEFINE adjacent to the body of 
the function, which is good practice, it is necessary to use a 
hop-around construct as we have done above with ROT(A) and 
INC (NAME). But this is troublesome and wastes space.  Execu- 
tion time space is required for: (1) the string bearing the 
function prototype, (2) the code required for the DEFINE, the 
hop-around and the target of the hop, and (3) the string 
bearing the hop-around label. The third item above is ex- 
plained more fully below. 


8. By means of the indirect qoto it is possible to do a multi- 


way branch. For example: 
: ($TRIM (INPUT)) 


will read a label and go to it. But this requires that every 
label must be in the symbol table at run-time. Not only must 
the physical characters of each label be present but an amount 
of additional storage to house other data associated with a 
name. This additional information averages about 32 characters 
across several implementations. A 40-character storage penalty 
for each label is considerable for large programs. 


9. In SNOBOLU, INPUT/OUTPUT is markedly clean and uncluttered; 
but it generally lacks facilities. If one is only transmitting 
strings to sequential files, SNOBOLU is adequate. However, no 


special facilities exist for printing columns of numbers or 


for doing direct-access I/O. Output media intended for human 
viewing is really two dimensional and merely outputting 
strings is inadequate. Although an extension to the language 


was made in this regard [Gimpel 1972a] space limitations have 
excluded it from most implementations. 


10. The statement 
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results in a strange error. One must write '0.1', not *,1', 
because unary '.' is an operator, which should be applied to a 
variable, not a value such as 1. 


11. There are several precedence anomolies. In virtually all 
programming languages, the operators '/' and '*' have the same 
precedence and associate to the left. In SNOBOLU, '*' has a 
higher precedence than '/'. 


The precedence of concatenation is one of the lowest whereas 
it should be one of the highest. Thus, 


A B+C 
is parsed as A (B + C). 


The two highest precedence binary operators, viz. '-" and  '?' 
associate differently. The first associates to the right and 
the second associates to the left. What is one then to make 
of: 


A. — B ? C 


12.  SNOBOLU usurps the characters '<"' and _ '>' for bracketing 
which renders them unusakle as operators. This means one must 
use the relatively primitive: GT(X,Y), GE(X,Y), etc. But 
square brackets are available, at least in ASCII, for the pur- 
pose and these are unused. 


13. The use of a blank to denote concatenation seems to force 
the language to require surrounding binary operators with 
blanks. Thus, it is a mistake in SNOBOLU to write 'A*B'; one 


must write 'A + B'. This causes learning problems. 


The blank operator also requires placing a function call adja- 
cent to its arguments. A common mistake for beginners, for 
example, is to write: 


TRIM (INPUT) 


and wonder why the TRIM function didn't work. No error can be 
Signalled for this sequence, of course, which dutifully 
prepends the input with the current value of the variable TRIM 
which is probably null. 


15. TO compound the learning difficulties, the blank binary 
operator is also used to denote pattern matching. If one is 
teaching  SNOBOLU one must explain why the sixth blank below 
denotes pattern matching while the others denote  concatenta- 
tion. 


((A BC) ABC) ABC 
AS. While SNOBOL4Y is more than just a string language, the 


facilities of the language are geared much more for string 
processing than any other kind. For example, although SNOBOL4 
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contains arrays there is no way to automatically sequence 
through an array as one can by pattern matching a string or as 
is possible with APL. Worse, SNOBOL4Y does not even contain a 


conventional repetition- element like the DO-1oop. Also, the 
tracing facilities, while quite useful for strings yield lit- 
tle information with arrays. When accessing strings to do 


fairly complex activities one does not mind paying a small in- 
terpretive overhead since this is a relatively small part of 
the overall computation. But the interpretive overhead of ar- 
ray processing can be several times the cost of accessing the 
array. The net result is that although SNOBOL4 contains ar- 
rays, it is not very good at processing them. One is much 
better off in some other language. Similar remarks may be made 
with perhaps less force about the programmer-defined datatype. 


16. There is some language clutter which could be removed. 
In particular &TRIM, S6INPUT and 80UTPUT were introduced into 
the language to overcome implementation inefficiencies of 
MAINBOL. The &ANCHOR keyword invites unstructured programming 
and should be abolished. The VALUE function was a nice idea 
but was defined incorrectly and, in its current form, is use- 
less. I know of no serious uses of the SUCCEED pattern but, 
if needed, one could use ARBNO(NULL) were it not for the fact 
that SNOBOLU attempts to 'protect' you from having a null ar- 
qument to ARBNO. 


17. Although essential for some applications, FENCE and ABORT 
are difficult to learn and use and do not compound very well. 
A NOT function would have been better. See chapters 6-8 in 


this respect. 


It is hoped that the reader has not by now come to the conclu- 
sion that SNOBOL4 is an utter abomination. With care and 
foresight many of these deficiencies can not only be overcome 
but turned to advantage. We will see ample evidence of this 
in this and the remaining chapters. It is also the writer's 
hope that this catalog of defects can serve to dispel the no- 
tion that a recognition of a language's strengths is tan- 
tamount to being in love with the language and hence blind to 
its flaws. (This happens frequently but it is not a universal 
phenomenon.) 


Having thusly disposed of the bath water, and assuming that we 
still have our baby, we may proceed to the important topic of: 


| PIX CRM M EDI DECE EDAM CECI ae ee ae 

| Z£ tructured Programming | An unsophisticated program- 
| % MIMI mer, in a surge of program 
| £888 | ming frenzy, will write a large program straight-out 
| | over several pages which will exhibit no evidence of 
| 48% | structure. Such programs generally prove to be bit- 
CNS terly difficult to debug and modify. Dijkstra [1968] 
cited the over use of the goto as one of the most flagrant 
abuses in such run-on programs. Willy-nilly transfers of con- 
trol from one program segment to another results in a mangle 
of spaghetti-like confusion. In fact, the abuse has become so 
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great that a controversy has arisen over whether the goto 


It is this writer's contention that improper use of the goto 
is a symptom rather than a cause of poor structuring. To 
properly structure a large program it must be decomposed into 
smaller subroutines (or, equivalently, functions, procedures, 
etc.). Subroutinizing reduces the overall size of a program 
since the same section of code may be referred to by several 
different statements. It also allows greater flexibility in 
the writing of a program since it is often unclear at the 
Start where an important subactivity will be needed. But the 
most important aspect of subroutinizing is the structure it 
endows the overall program. With reasonably well-defined in- 
terfaces between subroutines, the complexity of a large 
program becomes merely the sum of the complexity of the in- 
Gividual component routines, not the product or some higher 
order function. Under such circumstances, the subroutine call 
becomes the primary method of inter-routine transfers of con- 
trol. Intra-routine transfers of control can quite comfortably 
be made with the goto. In fact, many algorithms described in 
a half dozen or so English statements use the goto as a means 
of making more precise that which might otherwise be am- 
biguous. Far from being inherently evil, the goto is a power- 
ful, and the most basic, control element. It is perhaps 
because of this power that it can so easily be abused. 


But whereas we may elect to keep the goto as a control element 
of last resort, it is not generally the best control structure 
for all circumstances. In particular, the IF ... THEN ... ELSE 
eee Sequence as well as a repetition structure (such as the 
Fortran DO) are ideal in many instances. Their absence in 
SNOPOLU has led some critics to be unkind to the language. To 
a certain extent the deficiency is real, but is ameliorated 
considerably by what may be called the implicit iteration of 
pattern matching. Thus, the statement: 


S e e = ee 


which removes the first blank from the string S contains an 
implicit iteration over the characters of the string S. The 
result is a statement which is considerably easier to under- 
stand than an explicit sequencing. Thus the reason for the 
lack of conventional control structures in SNOBOL4Y is that the 
need for them is not felt so acutely. As confirmation of this 
supposition, APL, with its many forms of implicit array itera- 
tion, also lacks the standard control structures (other than 
the goto). 


It would not be correct to conclude that to write large 
programs in SNOBOL4 we subroutinize everything in sight and 
let it go at that. Certain conventions must be followed with 
respect to names of labels, global variables, keywords, etc. 
so that separately written subroutines can co-exist comfor- 
tably. A system of conventions of this kind is followed in 
writing the individual functions in this book so that they in- 


Conventions_____________________Page 19 


deed can be joined together without mutually interfering with 
each other. Many of the routines, in fact, call each other 
and the text processor which produced this book is a rather 
large assemblage (over 3000 statements) of functions which in 
some cases are identical to routines described and in all 
cases were written according to the conventions advocated. 


Coe yp ee ae RUIN ee ee 

| £848 onventions | In order to write well-structured 
(5$ (———————À programs in SNOBOLU it is rather more 
(8$ | important to establish a system of conventions than 
| $ ( in other languages. This is because the language 
| $495 | does not support separately-compiled functions and 


CS hence there is a potential problem with name con- 
flicts. Another problem has to do with mode switches. For 
example, if we write a function which uses pattern matching, 
we are not generally free to set the mode of &ANCHOR. TO do 
so would set the mode of &ANCHOR for the calling routine. But 
how can the called function know which setting exists for the 
&£ANCHOR switch? There are only two ways out of this dilemma; 
either the called routine saves the old value of &ANCHOR, as- 
signs it a new value, and restores the old value before retur- 
ning, or it makes an assumption as to what its value will be 
and all routines live by that assumption. The first method is 
clearly too awkward and is made more odious by the thought 
that we would have to do the same for 6FULLSCAN as well. 
Hence, our routines will assume these keywords to contain cer- 
tain values. There are perhaps good reasons to always assume 
$ANCHOR to be on and/or to assume &FULLSCAN to be on, but we 
will abide by the convention that they always have their 
default value of 0 (off). 


It is possible to vary the value of variables having preas- 
signed (pattern) values such as ARB, BAL, FAIL, etc. However, 
it should be obvious that it is poor practice to change these 
values for normal programming. The only exception may be to 
modify ARB (and other patterns) in an upperward compatible way 
for debugging purposes. For example, if we set: 


ARB = ARB $ OUTPUT 


at the beginning of the program then every string matched by 
ARB will be printed. Since such a modification only produces 
an upward compatible side-effect, and since the change is only 
temporary, no ill can come of it. 


It is also poor practice to redefine built-in operators and 
functions unless they are done in an upward compatible manner. 
For example, since the SIZE function is not pre-defined for 
array arguments it is not necessarily poor practice to 
redefine the SIZE function so that if the argument is an array 
it will return the number of elements in the array (a function 
which is very possible to write in SNOBOL4). On the other hand 
to redefine SIZE where it is already defined is to produce the 
sort of global change in the language which makes 
subroutinizing difficult. 
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How should names be kept separate to avoid collision?  Con- 
flicts can occur with names of functions, variables, and 
labels. Since the number of functions are relatively small (a 
few hundred at most) there is generally no problem here. The 
names of functions in this book were generally chosen after 
English words and if this is the case conflicts are readily 
apparent. 


Variable-name conflicts could be a severe problem if one does 
not subroutinize. If one does, the problem virtually disap- 
pears. One simply designates the variables to be temporary to 


some given procedure. If the functions are kept short enough 
no problems arise. It's occasionally necessary to use global 
variables. Here potential conflicts can arise unless one is 


careful. We will use the general policy of designating such 
global names with a name bearing one of the special characters 
t.t or '_'. This tends to reduce the possibility of collision. 


We will typically use the '.' in a pattern name to suggest 
that a variable is being assigned a value. Thus we may write: 


LEN1.T = LEN(1) . T 


and the name becomes a convenient mnemonic. In fact if this 
is not done a strong argument can be made that the use of a 
pre-defined pattern is too obscuring to be used as a general 
programming practice. 


To keep labels from conflicting we will employ the usual prac- 
tice of appending an identifying suffix to some convenient 
root. Thus, for function ALPHA, we can use labels ALPHA 1, 
ALPHA 2, etc. Labels such as LOOP or DONE are obviously poor 
practice except for examples or in a main routine but we al- 
ways shudder a bit when forced to contemplate them. 


We will rely a great deal on the following convention for 
defining functions. The DEFINE function must be executed in 
SNOBOLU before a function can be defined. For well-structured 
programs, the body of the function should be adjacent to the 
function definition. The function body should not be entered 
other than via a function call. Hence we will use a hop-around 
convention. To define the function ALPHA() we write: 


DEFINE ("ALPHA () *) 
Initialization for ALPHA 
: (ALPHA_END) 
ALPHA 


Function body of ALPHA 
ALPHA_END 


As ¡indicated here, unless we have special reasons for doing 
otherwise the entry label will be the same as the name of the 
function. Following the call to DEFINE(), we have what is 


to variables, initialize tables, etc. The initialization sec- 
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tion is especially helpful in SNOBOL4 since for efficiency 
reasons many patterns should be defined 'out-of-line'!. The 
ability to perform initializing computations on a per-function 
basis is not generally available in most programming 
languages. Hence, the hop-around technique, which at first 
appears to be a cumbersome apparatus for overcoming a language 
deficiency, becomes a language asset for structuring one's 
programs. 


Other conventions are as follows. Although the initial value 
of each variable is the null string, we will not generally use 
this fact. Hence, the initialization section is free to modify 
any variable not used globally (i.e., one whose name does not 
contain one of the special characters '.' or '_*). An excep- 
tion is the variable NULL whose value is never changed. Of 
course any variable which is a temporary variable of a func- 
tion will be automatically assigned the null string before 
function entry and this fact will be used throughout. 


Occasionally a transfer is made to the label ERROR. It is not 
necessarily presumed that a label named ERROR actually appears 
in the source program. If a branch is attempted to some un- 
defined label, the program will halt and an appropriate 
diagnostic will be given. This will indicate where the error 
occurred. It is also helpful in this regard and in general to 
always set  8DUMP on (=1) at the start of the program as this 
can provide vital clues as to the source of any error. It is 
easy enough to turn the &DUMP off if the program terminates 
normally. 
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V 1 

t— his chapter covers basic conversions of a kind fre- 
li quently needed in a computer environment. We are 
N presenting this material first, not necessarily 
l1 


because it is the easiest but because it is relatively 

unsophisticated. That is, the intent of a program that 
does a conversion will probably be clear even if nothing else 
is. SNOBOL4 is a qood language to represent conversion al- 
gorithms because frequently the objects converted are strings. 
This is natural because we are normally converting between two 
external representations of the same thing and the way we 
represent things externally is most often via strings of 
characters. 


fw oe re IN ON 

(| Program || UPIO is a program for converting all upper 
1i 2.1 Ó 4 case characters within a string to lower 
(| UPLO 1 | case and vice versa. Thus UPLO('UPlo*) will 
 AAAAA<<<«A return  'upLO'. In all cases, characters 


which cannot be converted are left unchanged. The program as- 
sumes the IBM 360 EBCDIC encoding of characters [IBM360a; 
Appendix FJ. There are many uses for such a program owing to 
the relative difficulty of keypunching lower case letters and 
the growing use of printers with lower case graphics. 


O RR RE E E a | 
| UPLO(S) will convert upper case to lower case and vice | 
| versa. The argument S is an arbitrary string. Nonal- | 
| phabetic characters are ignored. | 
ec ———— ——— "———— À——— P——————M— —————— UO A | 
DEFINE ('UPIO(S)*) 
———O—————————————Á——— á———À 
The first problem is to obtain the sequence of lower case 
letters. This is done by a computation to avoid having to 
type lower case letters in the program itself. The com- 
putation depends on the fact that the upper case letters 
and the lower case letters are arranged in an identical 
pattern on the EBCDIC chart. The only difference is that 
the lower case letters are in the 3rd quadrant (Q3) of 
&ALPHABET and the uppers are in the 4th quadrant (QU). 
——— Má————!HÓ———————PÀ— d —  Á—— O H———RBRÓRP | 
SALPHABET LEN (128) LEN (64) . Q3 LEN(64) . QU 


—— erar n 


UPPFRS =  'ABCDEFGHIJKLMNOPORSTUVWXYZ ' 
LOWERS =  REPLACE(UPPERS, ,QU, Q3) 
UP LO UPPERS_ LOWERS_ 


LO_UP LOWERS_ UPPERS_ 


: (UPLO_END) 


SS RARA AAA, | 
| Then the function UPLO merely consists of a call to the | 


| REPLACE function. l 
AA a l 


UPLO UPLO = REPLACE(S, UP_LO, LO_UP) : (RETURN) 
UPLO_END 
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Epilogue 


As discussed in chapter one, we will generally begin a func- 
tion with a call to DEFINE. Following this is the initializa- 
tion section. Here we initialize variables such as UP_LO so 
that subsequent execution is fast. After initialization a 
transfer around the function body is made to a label which is 
normally the function name followed by ' END' (UPLO END in our 
example). When the function is called, execution normally 
begins at the statement labeled with the same name as the name 
of the function (UPLO in this example). 


The encoding of UPLO depends on the arrangement of characters 
in the string 8ALPHABET, The characters shown in the box below 
are the result of printing SALPHABET on the printer used to 
produce this book. 


-.X«(*18 1$ 
QA >? D 


N 


* abcdefghi*t {<5,+’ jklmnopqr~} \3+¢] 
—pstuvwxyzl Lp([>00123456789)3,. ]4—| 
—— M  _  _— — —— $ 
ABCDEFGHI JKLMNOPQR | 


STUVWXYZ 0123456789 | 
—————————— E, | 


In EBCDIC, &ALPHABET contains 256 characters which may be 
regarded as consisting of four quadrants of 64 characters 
each. In the above, each quadrant is printed in a separate 
sector as two lines of 32 characters each. It is easy to see 
from this table that the relative positions of the upper and 
lower case alphabets in their respective quadrants is the 
same. Hence it is possible to obtain the lower case alphabet 
from the upper case by a simple replacement. 


Although UPLO is character-code dependent, it can easily be 
modified for ASCII [ASCII]. In this case, &ALPHABET contains 
128 characters whose printing graphics are shown (in order) 
below. 


x CEN EI DM ECCL CC M LEO ML CLE IDEM ee eee 


I"$$X6' () *+,-./0123456789: 5s <=>? 


| l 
| | 
i | 
(SS ——Á—— d 
(9ABCDEFGHIJKLMNOPORSTUVWXY2[M]^ | 
[eee ee 
l | 


* 


abcdefghi jklmnopqrstuvwxyz(!])^ 


UPLO can be modified to operate with such an SALPHABET by 
changing five numbers. 


ea aac | 


E Program E The transition to the 3rd generation 
li 2.2 E brought with it, for IBM users, a charac- 
(| BCD EBCDIC {ff ter conversion problem. The old 6-bit 
—————' BCD code was replaced by an expanded 


8-bit code. One disadvantage of the older code was that busi- 
ness and scientific users had different graphics for the same 
card code. In particular, the 5 characters #@%<& known only 
to the business users had the same card code respectively as 
2*()* which were known only to the scientific user. These 
two sets diverged in the 3rd generation. The fortunate busi- 
ness users saw no change, but the scientific user (such as the 
FORTRAN programmer) suddenly found lots of strange characters 
in his source program. 


In such cases one would like to write a program to convert an 
input deck with these 5 commercial characters into the scien- 
tific equivalents. One such program is Program 2.2; it appears 
On one line and in the days when we were converting to 3rd 
generation, I found it convenient to carry such a card on my 
person as a ready answer for anyone wishing to know the 
whereabouts of a program for translating BCD to EBCDIC. 


re ee MM pe, a en E N 
| This is a complete program to convert BCD card code to | 
I| EBCDIC card code. Input cards will be read in, converted, | 
| and punched. When no more cards remain the program | 
| terminates. | 
A  —————————— Ó—— A AAA 


L PUNCH = REPLACE(INPUT, "*0%<8", "="()+") :S(L) ¡END 


Epiloque 


This is a neat and compact example of the use of the REPLACE 
function. A card is read in and any character of the second 
argument found in this card is replaced by the corresponding 
character in the 3rd argument. The REPLACE function is fast, 
proceeding at machine speeds (on the IBM 360-70 a 256-byte 
table is set up, after which a single instruction (TR) trans- 
lates the entire string (IBM360a]). The REPLACE function is 
not Only extremely useful for such transliterations but, as we 
shall see in the next chapter, can be used for permuting and 
rearranging characters as well. 


A ee pn eon 

{{ Program {| ROMAN will convert its argument, assumed to 
B 2.3 N be an integer, into Roman numeral format. 
{| ROMAN E Thus, ROMAN(256) returns 'CCLVI'. Though a 
AAA>> classic problem in string manipulation, the 


reader may wonder about the utility of such a program (are we 
going to use SNOBOLU to print tombstones?). But there is one 


common application in which such an algorithm is essential, 
viz. a text formatter which must number pages preceding the 
first with Roman numerals. In such cases it is customary to 
perform computations (such as adding one for each page) in the 
normal Arabic system before converting. In this example, the 
Roman numeral would normally appear in lower case. This con- 
version, if necessary, can be done using UPLO, Program 2.1. 


Although it occasionally happens that we wish to convert from 
Arabic to Roman we almost never want to do the reverse so that 
we will be content here with going in one direction only. 


Ge a O ee a CA d ee Eccc ee eee es ee 
| ROMAN(N) will return a string equal to the Roman numeral | 
| equivalent of the integer N. N is assumed to be less than | 
| 4000 and nonnegative. | 
AA E A ————————— SS | 


DEFINE (' ROMAN (N) T!) : (ROMAN_END) 


E E 
| Entry point: remove the last digit and call it T. | 
LAA a i a i ee ne 9994 | 


ROMAN N  RPOS(1) LEN(1) . T = : F (RETURN) 


| arg c IK a a a E 

{| Convert T to its equivalent Roman form. Then append it to | 

| the Romanized form of the preceding digits multiplied by | 

y 10. l 

Cor l 
"0,11, 211, 3111, 4IV, 5V,6V1I,7VII, 8VIII,OIX,' 


+ T  BREAK(',') . T :F(FRETURN) 
ROMAN = REPLACE (ROMAN (N), *IVXLCDM!, 'XLCDM**') T 

* : S (RETURN) F (FRETURN) 

ROMAN END 

Epilogue 


The big trick here is to realize that it is relatively easy to 
multiply a Roman number by 10 by merely doing a translitera- 
tion of its symbols into the next higher 'octave'. This is 
done by REPLACFE. Another trick which reduces the size of the 
program is to compact a set of information into a long string 
and use  SNOBOLU's powerful pattern matching to extract the 
information. 


This is not the fastest encoding of ROMAN. There was no effort 
to economize on time because it may be presumed that the use 
Of ROMAN is infrequent. If anything, an effort was made to 
reduce the size of the program in order to minimize storage 
consumption. This is good practice for seldomly used code. 
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t! Programs li The decimal system in common use to 
li 2.4 6 2.5 E represent numbers is a positional 
Il BASEP 8 BASE10 {ff system, meaning that the value of a 
E _ ————À————————— digit depends on its position. 


Generally, in a positional number system, the numeral 
Qdgdo eee An 
represents the number 


n- 1 n-2 
a,B + aB t eee + An 


where B is some integer called the base. The decimal system 
uses B = 10. A positional system can represent arbitrarily 
large quantities with only a finite number (equal to B) of 
symbols. This is in contrast to the Roman numbers where the 
value of a symbol depends on the symbol itself and not on its 
position. Hence,. for arbitrarily large numbers, we need ar- 
bitrarily many symbols. 


Though our current decimal system was introduced in Europe by 
the Arabs in the 9th Century, the system did not flourish 
there until the 16th Century Spanish merchants were humiliated 
by the arithmetic prowess of the stone-age Mayan Indians who 
were using a base 20 positional system. See Von Hagen ( 1960]. 


The growth of computer systems in which base 2 arithmetic is 
used internally to represent numeric quantities has drawn at- 
tention to the representation of numbers in various bases and 
has led to the need in many cases to convert from one base to 
another. 


In this section we include two routines for base conversion. 
BASEB(N,B) will convert integer N into its representation in 
base B. Thus, BASEB(15,3) will return '120' as this is the 
base 3 representation of 15. Conversly, BASE10(N,B) will con- 
vert the numeral N in base B to the equivalent decimal number. 
Thus BASE10 ('120*,3) will return '15'. This is customarily 
written 


(120) = 15 


where the absence of an explicit base indication implies base 
10. 


To convert N from base b, to base b; we could combine the 
functions thusly: 


BASEB (BASE10 (N, b,), bo) 


The characters used to indicate digits higher than 9 are the 
letters of the alphabet with A equal to 10, B equal to 11, 
etc. This seems to be the most common method of denoting the 
higher digits. On the other hand, there are dissenters who 
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say that this encoding is unnatural in that the even letters 
(B, D, F, etc.) correspond to odd numbers (11, 13, 15, ...) 
whereas the odd letters (A, C, E, ...) correspond to even num- 
bers (10, 12, 14, ...). These people might prefer the letters 
'XARC.. rather than ‘ABC... another method might be to use 
some arbitrary sequence from the end of the alphabet such as 
'UVWXYZ' rather than 'ABCDEF'. In either case, the functions 
BASFB and BASE10 can be modified to suit by changing the value 
of the global variable BASEB ALPHA. 


MMC ICM MMC C c c EM DR ENS 
| BASEB(N,B) will convert the integer N to its base B | 
| representation. B may be any positive integer <36. | 
| BEEN TI —————X—————————— —  ——PÓ— —————áá——gá— m | 
DEFINE (! BASEB (N, B) R,C') 
BASEB ALPHA = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ ' 
: (BASEB | END) 
E a IMAGINE CM C RCM C LEM MEC EGLI CE A 


( Entry point and top of loop: If N is zero we are done Í 
-————— A E EE EEE E EE E E ET | 


BASEB EQ (N, 0) : S (RETURN) 
| saei SEMAINE EIER MG MICI CC MC IMG ELO MMC IM IDE ICM MEME EC CNN IHUECE SES ee ee 
{| Obtain the base-B representation (C) of the least | 


| significant digit of N. | 
| ———————————————————S———————————— ————UEe"— J—————————————————— — — 
R = REMDR(N,B) 
BASEB ALPHA TAB(*R) LEN(1) . C  :F(ERROR) 


go EET E CELUM UNT EISE XUI E PN m Pee P e EE a MUT Tu T 
| Tack result onto previous value, update N and loop. | 
| ———H ——ÉH—————————————————————Ó———— CS | 


BASEB = C BASEB 

N = N/B : (BASEB) 
BASEB_END 
Gi P ADM ME E AO, | 
| BASE10 (N,B) will convert the string N assumed to bea | 


| numeral expressed in base B arithmetic to decimal (base | 
( 10). | 


poe ————— ———— ———————ÁÓ——— —ÁÁÓ——  ——— Á€—— ÁÁ—seQ—. 


DEFINE (' BASE10 (N,B) T!) 
BASEB ALPHA = '0123456789ABCDEFGHIJKLMNOPORSTUVWXYZ'! 
: (BASE10 END) 


Qe eT, oe ee ee C DECEM O MCCC BATS Ger Meg pe? SORA te ICM CER REN O D A 
| Entry point and top of loop. Find first digit in N and | 
( determine its value in base 10. { 
A A | 
BASE10 N LEN(1) . T = : F (RETURN) 
BASEB_ALPHA BREAK (*T) aT : F (ERROR) 


SR ER a SI EL | 
| Then use standard conversion algorithm for converting to | 
IĮ base 10. | 
A IE EE EE E NE 

BASE10 = (BASE10 * B) + T : (BASE10) 
BASF10_END 
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Epilogue 


In BASEB, the search for the representation of the Rth charac- 
ter is done using the pattern 


TAB (*R) LEN(1) . C 
This pattern is identical in performance to the pattern 
TAB (R) LEN(1) . C 


Strangely enough, the former is faster in SPITBOL. This is 
because TAB(*R) LEN(1) . C is a constant valued pattern and 
can be pre-evaluated, whereas the same pattern without the '*! 
is not constant. It requires more time, in general, to form 
the pattern than it does to do the pattern match so that much 
has been gained. A similar remark can be made about the pat- 
tern matching statement involving BREAK(*T) immediately fol- 
lowing label BASE10. 


In SNOBOI4, similar considerations apply except that the 
programmer must pre-evaluate his own expressions; the compiler 
will not do it for him. Thus 


CONVERT R = TAB(*R) LEN(1) . C 


BASEB ALPHA CONVERT R 


would yield a more efficient rendition, in SNOBOLU, of the 
function BASEB. This is recommended if speed is of importance. 
The pattern CONVERT R could be defined in the initialization 
section of the function thereby keeping the pattern associated 
with the function. But note that 


CONVERT R = TAB(R) LEN(1) . C 


BASEB ALPHA CONVERT R 


would not be valid because the pattern CONVERT R would be 
using the value of R at the time of assignment and not at the 
time of the pattern match. 


We will not always use a deferred form such as TAB(*R) but 
will generally prefer TAB(R). This is simpler and is not im- 
plementation dependent. It is always easy enough to modify 
the function so that a pattern is not continually being 
generated. Choosing the path of least resistance, as we will 
tend to do, has another advantage. For those programs for 
which space is more important than time, pre-defining the pat- 
tern is actually less efficient for the pattern must then 
occupy space continuously and not merely when it is needed. 


TA O ES ci SO EE ete SD CET. CR GRRUL» SED AA GHHRD- A LO A O CATS AND CL PEROSIO AO AA E «UD uxanp 


AAA a A 

Il Program |! To a human being a character is some 
1 | 2.6 11 geometric configuration, but to a machine it 
li HEX N is just a sequence of bits. On the IBM 
Ld 360-370 series machines, a character is a 


sequence of 8 bits. For example, the pattern of bits represen- 
ting the letter A is 


11000001 


it is obviously more convenient to write these 8 bits in base 
16 notation so that A comes out looking like 


C1 


HEX(S) is a function which will accept a string of characters 
and return a string of hexadecimal digits representing its in- 
ternal representation. Thus 


HEX ('* ABA!) 
returns 'CÍíC2C!'. 


All characters have an 8-bit code and all 8-bit codes 
represent some character, but not all characters are prin- 
table. Thus the SNOBOL4 keyword &ALPHABET is a string of all 
the 8-bit characters starting with 00000000 and going on up to 
11111111 (in numerical order). If this string were to be 
printed (as we did earlier) most of the characters would ap- 
pear blank. The graphical image printed is a function of the 
printer. The IBM 1403 printer has room for at most 240 
graphics. Moreover, to increase printing speed there are many 
duplications of the more frequently appearing characters. The 
net result is that there are seldom more than 100 graphics in 
& ALPHABET. Thus, an important use of HEX is for processing 
data which is not character oriented and is therefore not 
easily dealt with in terms of characters. For example, suppose 
we wish to scan the input text for 2 consecutive occurrences 
of the hexadecimal constant 50. Then the following statement 
would perform the scan 


HEX (INPUT) POS (0) ARBNO (LEN (2)) '5050! 


SE RR RA AAA. 

| HEX(S) will return the hexadecimal (internal) representa- | 

| tion of the string S. | 

AAA AAA EEE | 
DEFINE (' HEX (S) *) 


A a RRS aa aa RS | 
| Prepare tables of the 1st and 2nd hex digits. | 
an ce 


H = '0123056789ABCDEF' 
HEX 2ND = DUPL(H, 16) 
HEX_1 H LEN(1) . T = :F (HEX_END) 


HEX_1ST = HEX_1ST DUPL(T, 16) : (HEX_1) 
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CN A ARA ARAS ATAR 
| Entry point: Form the first and second digits separately | 


( and then blend them. | 
EEUU TOES S Ezine oH ac S NC E rae doo PP RC IEEE LI Oc DUCTOR E NR EE EE | 


HEX HEX = BLEND(REPLACE(S, SALPHABET, HEX 1ST), 

+ REPLACE (S, &ALPHABET, HEX 2ND)) : (RETURN) 
HEX END 

Names referenced Name Type Where defined 
by HEX: BLEND Function Program 3.7 

Epilogue 

We have taken an unusual approach in encoding HEX. It might 


seem at first that it would be better to prepare some table 
which would yield the correct pair of characters for every 
character in the &ALPHABET. But we have already noted how fast 
REPLACE can be so that we can obtain either hex digit ex- 
tremely quickly. The question remains as to how we may swiftly 
merge the 2 character sequences. This we do by the program 
BLEND (Program 3.7) which merges 2 eaui-length strings. As we 
shall see, BLEND also uses the REPLACE function in an unob- 
vious way and is quite rapid. 


SS 

(|! Program || CH(H) will take a string of hexadecimal 
li Za 1 if digits (H) and convert them to the cor- 
E CH (| responding character sequence. Thus 
———— CH('C1C2*) will return 'AB'. CH is the in- 


verse of HEX so that CH(HEX(S)) = S. The conversion provided 
by CH can be useful for obtaining characters that can be prin- 
ted but not typed. Thus CH('818283') returns 'abc'. 


| Hu ENIM OUI ECC AD E CI E c (CIE qM CMM [C CC C MM CIC CHECA AERE. 
| CH(HEX) will convert the sequence of hexadecimal digits | 
| into the corresponding character string. CH is the inverse | 
( of HEX. | 
| -— á—————— -—— A O E | 


DEFINE ('CH (HEX) T,C,N') 


: (CH, END) 
xd LIC IDEEN PIED EMI CDM M NMESIO CE OMEN ee a ee, oe 
| Entry point: Remove 2 characters from string HEX. Then | 


| convert to decimal (using BASE10) and retrieve the indexed | 
| character from the 6ALPHABET. | 
AA AAA | 


CH HEX LEN(2) . T = : F (RETURN) 
C = BASE10(T,16) 
&ALPHABET LEN(C) LEN(%) . C 
CH = CH C < (CH) 
CH END 
Names referenced Name Type Where defined 


by CH: BASE10 Function Program 2.5 
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Epiloque 


The method used to program CH is to treat each pair of hex- 
adecimal characters as a number in base 16. This number can 
be converted to decimal using BASE10 (Program 2.5). This 
decimal number can then be used to index into the keyword 
& ALPHABET. 


Se 
(| Program |i DAY will return the day of the week given 
E 2.8 1 some date. Thus DAY ('3/24/71') will return 
E DAY N 'WEDNESDAY', and DAY (DATE ()) will return the 
— P — 9: current day. As an added bonus, the global 


variable D will be set to an integer between 0 and 6 inclusive 
to give a numeric indication of the day. If a year other than 
one from the 20th century is intended then a 4-digit year must 
be given as in DAY('3/24/1825'). If the year is missing, the 
current year is assumed. Thus: 


"CHRISTMAS FALLS ON * DAY('12/25') ' THIS YEAR.' 


will be a sematically correct string when evaluated, no matter 
in what year it is evaluated. 


The program assumes the Gregorian Calendar and will accept 
dates for any date from the 2nd century onward (i.e. after 100 
A.D.). The extrapolation into the time period before the 
Gregorian calendar went into effect (1588), however, will not 
agree with historical records. 


It is interesting to note that the revision of the calendar 
followed on the heels of the discoveries of Indian civiliza- 
tions in the New World whose elaborate and involved calendrics 
are said to be even more accurate than our present Gregorian 
calendar (see Morley [1956] for example). 


| CENE REGNI A pee Ee Ee gee pe ECC IM ee Ft pe ee tp ee ee DD ee CECI LC O EAE 
{ DAY(DATE) will return the day of the week appropriate to | 
| the given DATE. DATE is given as month/day/year. | 
po UR A CPV IUE EN EEUNRCUCSNESE 


DEFINE (' DAY (DATE) M,Y') 


| Mec MOM E ee ee Ee MM DM UD MC M DO MOM C MM MA ae gee ee 
{| YEAR is the number of days in a year. YEAR 4, CENT and | 
| CENT ! are the number of days in the cyclic time periods | 
| of respectively 4 years, a century and 4 centuries. | 
| —————————————À——J—— ———— aes ) 


YEAR  - 365 

YEAR Y = 4 * YEAR + 1 
CENT = (25 * YEAR_4) - 1 
CENT 4 = 4 * CENT + 1 
DAY ZERO = 2 


: (DAY END) 
p UU UT UL EU ey E eee IE 
| First extract the month, day, and year. If the year is | 
| null the current year (obtained from DATE) is used. Then | 
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( '19' is prepended if the year is only 2 characters long. | 
| E EE Irc T — A | 


DAY DATE BREAK('/') . M LEN(1) 

+ (BREAK('/') . D LEN(1) REM. Y | REM. D) 
(IDENT(Y,'') DATE()) '/' ARB '/' REM. Y 
Y = EQ(SIZE(Y), 2) 1191 Y 


E A A ON 
( The number of days since March 0, 0000 will be computed. | 
| First compute the number of whole months and the number of | 
| whole years since that date. | 
LL lI V mcm E M ER MP EM RCRUM PM CEPI M MEM Mu E RE UCM MU E EE ER | 


M = LE(M,2) M+ 12 :F (DAY 1) 
Y = Y- 1 
DAY 1 M = M- 3 


NN Pl ee eal Ee E E ee a See a y ARI 
| Now add an appropriate number of days for each cyclic year | 
| period. Note: integer divided by integer yields integer. | 
— ——————————————— IS | 
DAY 2 DAY = (Y / 400) * CENT_4 + (REMDR(Y,400) / 100) * CENT. 
E +  (REMDR(Y, 100) / 4) * YEAR_4 + REMDR(Y,4) * YEAR. 

E E E ee 
| Now add an appropriate amount for the month (note that 153 1 
l is the number of days in a 5-month period), the day, and | 
{| an initializing constant. This value is taken modulo 7 | 
| and a search is made based on that value. | 
 — M — —— —À o P———————— ———————S '———————— ————Á——I—— HÓn———À——S 


DAY = DAY + ((153 * M) + 2) / 5 + D + DAY ZERO 
D = REMDR(DAY, 7) 
' OSUN IMON2TUES 3WEDNES4 THURS 5FRI6SATURT'* 
+ D BREAK('01234567') . DAY 
DAY = DAY ' DAY ! 2 (RETURN) 
DAY END 
Epiloque 


This program was modified for SNOBOLU from an Algol program by 
Tantzen [1963]. His version is slightly more efficient and we 
leave this refinement as an exercise. 


The program is done by a computation; it could also have been 
done by a look-up procedure in which a string might contain a 
month-day sequence in which the proper number of days are as- 
sociated with each month. In general, this would have been 
easier and less error-prone but would not have been as 
efficient. 


A very clever scheme is used to obtain the number of days that 
a given month is worth. It is recognized that if we start in 
March, the number of days per month is given by the sequence 
31 30 31 30 31 which repeats itself for effectively the 
remainder of the March - March year. The computation: 


153 * M + 2 


5 
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—_ ete ee GS OED EE EES ee EE RE o AA eS A RS ND CD ca te SR DA SE GS EEDA 


is so calculated as to yield precisely the correct number of 


> ee 

(|! Program If MDY(Y,D) will convert a year,day date into a 
li 2.9 11 month/day/year date. For example MDY (71,83) 
IN MDY N will return !3/24/71'. The global variables 
AMA M and D are set to equal the month and day 


respectively. MDY is useful in an environment where the system 
computes days but not months (such as OS 360). 


IMM me IDCM MCI MCCC CM pets Te NECI: | 
| MDY(Y,D) will convert its argument which is given as year | 
Il , day into month/day/year format. | 
AA AAA A A | 


DEFINE ('MDY (Y, DY) X,T') 


ee NU IR Pe a ee eee eee ye ig ee LE 
| Set up 2 tables to be searched. One showing cumulative | 
{ days vs. month (DAY MONTH) for normal years and one for | 
| leap years (LY_DAY MONTH). | 
AA an a a a ee i TM E 


DAY MONTH = (334,12) (304,11) (273, 10) (243,9)! 
+ ' (212,8) (181,7) (151,6) (120,5) (90,4) (59,3) (31,2) (0,1)! 
LY DAY MONTH = *(335,12) (305,11) (274,10) (244,9) ' 
+ ' (213,8) (182,7) (152,6) (121,5) (91,4) (60,3) (31,2) (0,1) ' 


EIA C MI I a E A I: CODD OM A o ee te sae 
{ Set up a pattern to search the tables. | 
| —————— — P——————— —————————Ó————— nw | 

I = SPAN('0123456789') 

SEARCH.X.M = '(' I $ X *GT(DY,X) ',' I $M  <(MDY_END) 
ee gp EMAILS eee ge ERO eT eee MC MEO CS III IAM DC ee eee ES | 
| Entry point: Set up the proper table in T. Use leap year | 
| table if Y is either (divisible by 400) or (divisible by 4 | 
I| but not 100). | 
———————— ———— —  ————— — ————— —Á | 


MDY T = EQ(REMDR(Y,400),0) LY DAY MONTH  :S(MDY 1) 
T = EQ(REMDR(Y,100),0) DAY MONTH : S (MDY.. 1) 
T = EQ(REMDR(Y , 4),0) LY DAY MONTH  :S(MDY 1) 
T - DAY MONTH 


sx ED ee CMS I CMM CMM DD ee ag LM EC DD CUM USt ID EI M C IC C INC E | 
f Then search the table for the current month (M) and the | 
{| number of days (X) associated with that month. Fail if DY | 
l is not a valid day. | 
A A E A A | 


MDY_1 T |. SEARCH. X. M :F (FRETURN) 
D = DY -X 
GT(D, 31) : S (FRETURN) 
MDY = M t/t D '/* Y : (RETURN) 

MDY END 

Epiloque 


We have written this program in terms of a 'table-look-up' 
procedure (actually strina look-up would be more correct). But 
we could have done this ky computational methods by turning 


the DAY function around and ‘pointing it backward’. This we 
invite the reader to try as an Exercise. 


oe UII 


(! Program 11 SPELL(N) will return an English phrase 
li 2.10 E designating the integer N. Thus SPELL(13) 
1! SPELL li will return 'THIRTEEN!. SPELL will convert 
MMS all integers from 0 to 999999999 (a thousand 
million - 1). SPELL can easily be extended to handle larger 
ranges; see Exercise 2.16. One obvious application of SPELL 


is in writing checks. 


DEFINE ('SPELL (N) M*) : (SPELL_END) 
E MA ORAR. | 
| Entry Point: Fan out to one of several labels depending | 


| on the value of N. i 
A EEEE EE E EE E | 


SPELL GE (N, 1000) : S (SPELL_1000) 
GE (N, 100) :S(SPELL_100) 
GE (N, 20) :S(SPELL 20) 
GE (N, 13) :S(SPELL. 13) 


Ei SEM KNEE CIIM GC A c PLC MMC GN M CD MMC eR a MC MEMO CMM CM ee ee 
| Here if N is 12 or less; look its value up in a table. | 
| urere OC TR c sea tr E A ED 

(* TONE, 2TWO, 3THREE, 4FOUR, 5FIVE,6SIX, JSEVEN, 8EIGHT, ININE,' 
+ ‘10TEN,11ELEVEN, 12TWELVE,') N ARB . SPELL ',' 3; (RETURN) 


E E | 
| Here to do the teens. It will be simpler to do the tens | 
| version and substitute 'TEEN' for 'TY' afterward. | 
A A O A —— O O ee | 


SPFIL_13 N 1 LEN(1) . M 


SPELL = SPELL(M 0) 
SPELL 'TY' =  'TEEN' 
SPELL 'FOR' = 'FOUR! : (RETURN) 


O ge ee ae eg GE a ge me gg INC EA ee a ee 
| Here to handle all compounds from 20 through 99. Just look | 
| up the root in a table and add the suffix 'TY'. Then call | 
| SPELL recursively to handle the units. | 
A A A SEE E E O O E RP 
SPELL 20 N LEN(1) . M = 
t ZTWEN,3THIR,UFOR,5FIF,6SIX,7SEVEN,8EIGH,9O9NINE,' 

+ M  BREAK(',*) . SPELL 

SPELL = SPELL  'TY' 

SPELL = NE(N,0) SPELL '-' SPELL(N)  : (RETURN) 


ee ee eg en MI MMC er ee ee ee pe ng eS Te ee 
| Hundreds are handled by converting the hundreds and tens | 
| recursively. | 
A i  'Á——D€—— Á"— —poueÁ— €——À 


SPELL 100 N LEN(1) . M = 


SPELL = SPELL(M) ' HUNDRED! 
SPELL = NE(N,0) SPELL ' AND * SPELL(N) : (RETURN) 
e 


A E E ace — Ho 
| For numbers over 1000, remove all but the last three | 
{ digits of N assigning them to M. Convert M, 'multiply* it | 


| by 1000 and 'add' N. | 
DA A A A REO RUNE NENNEN. 


SPELL_ 1000 

N RTAB(3) . M = 

SPELL = SPELL (M) 

SPELL * THOUSAND! = ' MILLION' 

SPELL = SPELL ' THOUSAND! 

SPELL = NE(N,0) SPELL * AND * SPELL(N) : (RETURN) 
SPELL_END 
Epilogue 


SPELL was written to be small rather than fast and uses recur- 
sion quite liberally and effectively to render a smaller and 
more readable program. 


CS ee | 
| Exercise 2.1 | Using strings prepared in the initialization 


AA» section of UPLO write a function UP() which 
will convert any lower case in its argument to upper case. 


AAA age 
| Exercise 2.2 | Given the function UPLO() and a function 


AS UP() which converts lower case to upper 
case, write a function LO() which converts upper case to lower 
case. 


crepe in en rae 

| Fxercise 2.3 | Given a paragraph in P assumed keypunched in 
AÑ upper case, use UPLO to convert P into lower 
case except that the first character of every sentence should 
remain capitalized. The first nonblank character is regarded 
as the beginning of the first sentence. Subsequent sentences 
are marked by a period followed by at least 2 blanks. (This 
requires only two statements.) 


E67 a pas = oe 

| Exercise 2.4 | Write a function (ARABIC) to convert a num- 
t ber in the Roman representation to one in 
standard (base 10) notation. 


[o o TR TN 
| Exercise 2.5 | Let {x} be the smallest integer > the real 


t———- number x (sometimes referred to as the 
ceiling of x). Thus 


(1.5) = 2 
(2.00 = 2 
(-9.5) = -9 
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With the help of functions defined in this section write 
SNOBOLU expressions equivalent to 


(1oge K} 
{logn K} 
where K and n are positive integers. 


Co ee 

| Exercise 2.6 | The Mayan Indians used a base 20 positional 
A number system. The figures for the digits 0 
thru 19 were built up systematically as in the table below. 


form equiv form equiv 


Hence the number 752 would be represented as 


Here the digits are run from left to right in descending 
Significance whereas the Mayans would allign their digits ver- 
tically. Also the dots ran in a direction orthogonal to the 
bars. One has a great deal more freedom in these matters if 
one is merely carving the figures out of stone. 


The exercise is, given the integer N write a loop to convert N 
to its Mayan form. This can be done in 4 statements (without 
using the functions defined in this chapter). 


Gora ge ee | 
| Exercise 2.7 | A hypothetical machine has a word size of 32 
AS bits represented as b1b>2 ... D32- The bits 


have the following meaning when representing floating point. 
S: b, (sign) O:positive, 1:negative 


E: (bg2...b,,) exponent of 2 in excess 1024 notation 
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F: fb12 ... Dz2j fractional part with decimal point to the 
left of b,s. 


Hence a floating point number will have the value: 


S F (E-1024) 
2 


21 
2 


Write a function (using the base conversion algorithms) to 
convert an eight-hexadecimal-digit machine word into a 
floating point number. 


O eS, 

| Exercise 2.8 | Extend the routines BASEB and BASE10 to han- 
AS dle decimal points, Assume a global cell 
PRECISION which will hold the number of digits of precision 
required in the fraction. Allow BASEB and BASE10 to call 
themselves recursively. 


fo ee ee | 

| Exercise 2.9 Y What statements would have to be modified if 
CNS» BASEB and BASE10 were to be extended to 
unlimited-precision arithmetic? 


AA ee ee 
| Exercise 2.10 | Let Y, N and M be integers. 
A A | 


a) Show that: 
REMDR (Y, N*M) /N = (Y/N)- (Y/ (M*N) ) *M 


and hence that line labeled DAY 2 in Program 2.8 can be 
rewritten: 


DAY_2 DAY = (Y / 400) * K1 + (Y / 100) * K2 
i + (Y / 4) * K3 + Y * Kyu 


where K1, K2, K3, KU are values which can be precomputed. 


b) Compute K1, K2, K3, K4. 


E | 

| Exercise 2.11 | Suppose there are 64 characters in 
Cama? §AT,PHABFT. Rewrite HEX so that it returns 
the base-8 representation of a string. Call the function 
OCTAL. 


Soyo ele ee eee 

| Exercise 2.12 | In writing a compiler it is sometimes 
t————— necessary to manipulate bits since the 
instruction is formed as a sequence of bits. 
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a) Set the Nth bit of a string S to 1. Assume the bits are 
numbered starting with 0 and ending with 8 * SIZE(S) - 1 (This 
assumes 8 bits per character). 


b) Invert the Nth bit of a string S. 


ee O 


| Exercise 2.13 | Using DAY, determine whether a given date 
AAA İS valid. For example, 2/29/1973 is 
invalid. 

umi LÀ e 

| Exercise 2.14 | Using DAY, write a program which prints a 


t--————— calendar for the month M and year Y. 


Cs eG tt on 

| Exercise 2.15 | Given that the number of days since March 0 
AS is (153*M*2)/5 where M is the number of 
whole months since that date, write an expression for the num- 
ber of whole months given the number of days. Using this for- 
mula rewrite MDY as a computation. 


Cn N 

| Exercise 2.16 | Assuming that a billion is a thousand mil- 
lion, add a single statement to SPELL to 
increase the range of convertable numbers to a thousand bil- 


ar oe EL 

| Exercise 2.17 | In the U.S. the terms billion, trillion, 
t———  quadrillion, quintillion, sextillion, sep- 
tillion and octillion refer to the numbers 1000 million, 10002 
million, 10003 million,..., 10007 million respectively whereas 
in Great Britain these terms refer respectively to million?, 
million3, million*,..., million®. Extend SPELL so that it will 
convert its argument up to the octillions in the British 
system. Note that SNOBCLU integers don't go that high so as- 
sume the input is string and don't use arithmetic operators 
(like GE) on anything too big. 


Oy ee ee 

( Exercise 2.18 | Pick a number; count the letters in its 
AÑ Spelled-out form and you produce a new num- 
ber. For example 13 is spelled ‘THIRTEEN! and hence transforms 
into 8. This transformation has the interesting property that 
its repeated application will cause every number to converge 
rapidly to 4. For example, starting with 13, the sequence 


13 8 5 4 4 4 4 ... 
is produced. Write a program to determine the smallest integer 


between 0 and 10000 which requires the most steps to converge 
to 4 (the integer is 113 and it requires 6 steps). 


— e A ED AA A O AAA EE A GES CES CSP AE ASA AO A RD A O ALO AA A O O AP A O O a GE AO. 


A | 
| Exercise 2.19 | The musical scale is given by the following 
AY» sequence of 12 notes. 


C C4 D D$ E F4 G G# A AF B 
Given a number N between 1 and 12, write a single pattern- 


matching statement to assign the Nth note (a one or two 
character string) to the variable NOTE. 
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Ir-  NOBOLU represents strings by a pointer to string 
j{t— storage. One of the consequences of this storage 
L—4| management philosophy is that the cost of string as- 
—3| signment is relatively low. That is, it costs very 


t—3 little to interchange string values among variables. 
In particular it is relatively inexpensive to pass string 
values to and from functions. 


The functions presented in this chapter all are fairly short 
utility-like functions which operate primarily with strings. 
We will see most of these functions later in the book where 
they will serve as lemma-like procedures to make larger 
programs more understandable. 


O | 

{{ Program fl ORDER (S) will return an alphabetized version 
E 3.1 N Of its argument S. Thus, ORDER (!* ORDER') 
E ORDER N will return 'DEORR'. The alphabetic ordering 
aii of characters is determined, as usual, by 


SALPHABET. To modify the ordering produced by ORDER the state- 
ment containing this keyword should be replaced. ORDER, as we 
will see, has many uses. For example, it furnishes an easy 
way to check for set equality. 


RAMO QE REA A ES 
| ORDER(S) will put the characters of its argument in al- | 
{| phabetic order. | 
P ID Ec D — Áá—————]1ÁW 
DEFINE ('ORDER(S) T, HIGHS, S1!) 
: (ORDER. END) 


| GENI i eG NM CGU DI CC ECC Re, Oe SRA PE TR CRB M E EVI MCI CEN AV EC ICE CC c CURE | 
| Entry Point: Extract a character (T) from S; obtain (in | 
| HIGHS) characters alphabetically > the extracted charac- | 
| ter. Then scan ORDER for the first occurrence of one of | 
| these higher characters. | 
A A | 
ORDER S LEN(1) . T = :F (RETURN) 

SALPHABET BREAK (T) REM . HIGHS 

CRDER (BREAK (HIGHS) | REM) . S1 = S1 T : (ORDER) 
ORDER END 


Epilogue 


ORDER is essentially a sorting routine and as such it is an 
insertion sort. Characters are extracted one at a time from 
the argument S and are inserted in order into the growing 
string ORDER. 
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N Programs B (available in SPITBOL and SITBOL)  LPAD 
11 3.2 & 3.3 E and RPAD are useful in formatting line 
(| LPAD 8 RPAD || output. They are patterned after the 
_ _ mmm built-in functions in  SPITBOL and are 


included here for use with SNOBOL4Y.  LPAD will pad on the left 
to fill out a string to the required field width and RPAD will 
pad on the right. Thus 


OUTPUT = RPAD(S1,60) LPAD(S2,60) 


will place string S1 on the left and string S2 on the extreme 
right of a computer printout page that happens to be 120 
characters wide. Both functions may be called with a 3rd ar- 
qument to indicate a pad character other than a blank. 


EE TER PEG II CMM MEE ee ge O C CL ED Iz MC ECC ee SEES | 
| LPAD(S,N,C) will pad string S on the left with character | 
| C until the string is N characters long. S is returned if | 
{ it is > N characters long. C is taken to be * ' if | 
| unspecified. | 
¡AAA AAA AA E 
DEFINE('LPAD(S,N,C) ') : (LPAD_END) 
LPAD LPAD = GE(SIZE(S),N) S : S (RETURN) 
C = IDENT(C) '"'* 
LPAD = DUPL(C, N - SIZE(S)) S : (RETURN) 
LPAD END 


p L————RUUy ig pe eg a E TI. T EU EI POL TE QM Tu NI AREE 
| RPAD(S,N,C) pads on the right rather than on the left but | 


| its behaviour is otherwise the same as LPAD. | 
 ——— —————— —————————————————————-—— A ——————— |) 


DEFINE('RPAD(S,N,C) ') : (RPAD END) 
RPAD RPAD = GE(SIZE(S), N) S : S (RETURN) 

C = IDENT(C) ' ! 

RPAD = S DUPL(C, N - SIZE(S)) : (RETURN) 
RPAD END 
Se 
II Program |i COUNT (S1,S2) will count the number of occur- 
if 3.4 N rences of string 3S2 in St. Overlapping 
11 COUNT li occurrences of S1 are counted as separate 
L amaeana occurrences. Thus COUNT('MISSISSIPPI', 'SI') 


returns 2, and COUNT ('AAA', 'AA') also returns 2. If a sub- 
string is not found the function effectively returns a zero 
(actually the null string). 


E RR E CC MMC EE E 
| COUNT (S1,S2) counts the number of occurrences of string | 
( S2 in string S1. | 
A TORIS TOR rr RO ee A e NEN 
DEFINE (' COUNT (S1,S2) FIRST, REST, P!) 
: (COUNT END) 


——————OA—^—A^A  —————————————————Á€— 
Entry point: Set up pattern P to scan S1. P makes rapid | 


is 
| 
| scan for first character of S2 and then checks to see if | 


| S2 matches. | 

Eo e A E LL E LE A E 

COUNT S2 LEN(1) . FIRST REM . REST :F (RETURN) 
P = POS(0) BREAKX(FIRST) S2 


Gg DECR GUI MN CEDE IM E Seay een QE ——— c [CC CIC I LC g LC C C S UC MMC | 
{| Find and remove all characters up to an occurrence of S2. | 


( If found put all but first character of S2 back onto S1. | 
IU ee A E E H—————— H——— | 


COUNT 1 S1 P = REST < F (RETURN) 
COUNT = COUNT + 1 : (COUNT. 1) 

COUNT END 

Names referenced Name Iype Where defined 

by COUNT: BREAKX Function Program 8.2 

Epilogue 


The simple-minded approach to this problem is to simply scan 
the string S1 for an occurrence of the string S2, removing all 
that precedes the substring and repeating the process until no 
more occurrences are found. A faster technique (used here) is 
to use the high speed operation of the BREAK function which 
Scans across a string at machine speeds looking for one of a 
class of characters. If successful, then and only then is the 
entire word (S2) matched. TO employ BREAK in this way it is 
convenient to use BREAKX which is defined in Program 8.2 
(BREAKX is a built-in function in SPITBOL but not available in 
SNOBOL4) . BREAKX, unlike BREAK, has implicit alternatives. 
If a pattern to its right (its subsequent) fails, it will try 
again, picking up one character to the right of where it left 
off. 


(^ ER 

(| Program | ROTATER(S,N) will rotate the string S right 
li 365 N by N characters. If N is negative the rota- 
(| ROTATER || tion will be to the left. Thus 


ER rQOoÓ_—— ROTATER('ABCD',1) will return 'DABC!. 


C ae e en ICI IC Ic CA" CAM A ge pO feta LC I CD SN 
| ROTATER(S,N) will rotate the string S right by N charac- | 
| ters. If N is negative, S will be rotated to the left. l 
AS | 
DEFINE ('ROTATER(S,N) S1!) : (ROTATER_END) 
IR A RM AAN RN E QS. 
| Entry point: If S is null, return. | 
| ————— ————— A II O E A | 


ROTATER IDENT (S) : S (RETURN) 


| PISCINE CC cC ÉL a CDM | 

| Reduce number of positions to be rotated modulo SIZE(S). | 

| Note REMDR preserves the sign of N. If N is negative, use | 

| complement. | 

| AAA E E M E E E Eee AI e | 
N = REMDR(N, SIZE(S)) 

N = LT(N,0) SIZE(S) - N 


| Perform the rotation and return | 


S RTAB(N) . S REM. S1 = st S 

ROTATER = S : (RETURN) 
ROTATER END 
FE yep ae GST 
(|! Program || (available in SPITBOL and SITBOL) REVERSE (S) 
E 3.6 E will return S with its characters reversed. 
{{ REVERSE |I| Thus REVERSE ('SERUTAN') will return 
—————— "NATURES §. One use of REVERSE is to effec- 


tively reverse the order of pattern matching. For example, if 
one wishes to replace the last occurrence of the substring SS 
in the string S with the string R one can write: 


S = REVERSE(S) 
S | REVERSE(SS) = REVERSE(R) 
S = REVERSE(S) 


a RS 
| REVERSE(S) will reverse the sequence of characters in the | 
{ string S and return the result. | 
——— a ec i a e es PERRO | 


DEFINE ('REVERSE(S)A1,A2,L") 


E a a a IR | 
| Initialize REV_ALPHA to hold the reversed alphabet. | 
c ae —————— ——— — ———— — — —— —— —MtoÓ——Ó PM A GN 


TEMP =  £ALPHABET 
REV. 1 TEMP  LEN(1) . T - :F(REVERSE END) 
REV ALPHA = fT REV ALPHA : (REV. 1) 


[SIMA RS | 
| Entry point: For oversize strings go to REVERSE 1. Also | 
{ ignore null strings. | 
| wu eR LE —— ——— — — ———————— í)ÍÍ—— 9$ 


REVERSE L = SIZE(S) 
GT (L, 256) :S(REVERSE 1) 
LE (L,0) : S (RETURN) 


A CP IKC M  C -—— c CD CLE A RISE EAE RED C MD MA I MCCC ICM C 
| Take the first L characters of &ALPHABET and the last L | 
| characters of the reversed alphabet and issue a REPLACE. | 
p—— EVE Í— a — ES Áá— e — A! ——— n -—R— QM —— | 

& ALPHABET TAB(*L) . A1 

REV ALPHA RTAB (*L) REM . A2 

REVERSE = REPLACE (A2,A1,S) : (RETURN) 
p UU UCET LA VETE EAS Ne UC RR EE IU 
l Divide and Conquer. | 
ko —— — MÀ MÀ e — ——— MÍ—anü—— aÜÀ MÀ ——À— À— IEA AAA 
REVERSE 1 S LEN (256) . A1 REM. A2 

REVERSE = REVERSE(A2) REVERSE (A1) : (RETURN) 

REVFRSE END 
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Epilogue 


The method used to perform the reversal follows a suggestion 
by Morris Siegel. It transforms a string, not by setting up 
the last 2 arguments of REPLACE and effecting a translitera- 
tion, but by setting up the first 2 arguments to accomplish a 
rearrangement. We will elaborate on this before continuing to 
the next function. 


E URDU MIT ey TE A O ne Te 
( #88 tring Transformations | A string transformation is any 


(8 ANNA function which accepts a 


| $*4€$ | string as argument and returns a string as value. 
{ $| As a humble example, TRIM(S) is a transformation 
| $**€* | which produces a string without trailing blanks. 
L— ——3) Special kinds of transformations exist which are 
either interesting in their own right or can be programmed to 
run very rapidly. 


T(S. S2) = T(S1) T(S2) (3. 1) 


That is, the transformation of the concatenation is equal to 
the concatenation of the transformations. Said another way, 
the transformation is context free. Since any string S can 
ultimately be decomposed into characters, c4,C2 ... Cn we have 


T (S) - T (c4) T(Co) +... T (Cn) (3. 2) 


And from this last equation we can see that a homomorphism is 
completely characterized by the transformation on individual 
characters. Let a4,ag2 ... Aan be a list of all the characters 
of the alphabet. Then the set of strings (T(a,), T(ag), +... 
T(an)) identify completely and unambiguously the transforma- 
tion T. 


A transliteration is an important special case of a homomor- 
phism in that each of the strings (T(a,), T(a2), ..., T(an)) 
is a character. If T isa transliteration then T can be 


programmed in SNOBOL! as: 
T(S) = REPLACE(S, SALPHABET, T(S6ALPHABFT)) (3.3) 


In this way any transliteration can be programmed to run very 
swiftly merely by obtaining the transliteration of  &ALPHABET. 
We have seen a number of examples of  transliterations. 
Programs UPLO (2.1), BCD EBCDIC(2.2) and HEX(2.6) all make use 
of REPLACE to perform the transliteration. 


Consider the following statement 
S = REPLACE (S, Sq, So) (3.4) 


Here S, and S> are two equi-length strings which describe a 
transliteration on the string S. In fact, only those charac- 


ters which appear in 3S, undergo a change. If we subject 
&£ALPHABET to such a transliteration to obtain 


TT = REPLACE(SALPHABET, Sa» So) (3.5) 


we can use the result to effect the same transliteration on S 
aS in (3.4). 


S = REPLACE(S, SALPHABET, TT) (3. 6) 


A k-transformation is a string transformation that operates 
only on strings of length k and is undefined for strings of 
other length. (Its domain is said to consist of the strings 
of length k.) For example, the permutation (1 3 2) which 
rearranges the 2nd and 3rd characters of a string of length 3 
is a 3-transformation since it only applies to strings of 


length 3. 


A positional transformation is a k-transformation in which the 
output is some rearrangement of the characters of the input 
string with the properties that 1) characters in some posi- 
tions of the input string may be dropped, while others may 
appear several times and 2) constant characters may be added 
into some fixed positions of the output string. But in any 
case the disposition of a character depends on its position 
and not its value. More formally, the positional transforma- 
tion on strings of length k can be described as: 


t C t c da t c t 


where tis to, ... are constant strings depending only on the 
transformation and i,, is, ..., in are constant integers 
chosen from the set (1,2, ... K}. 


An example of a positional transformation is depicted 
graphically in Figure 3.1. It transforms a restricted class 
of English words into the corresponding 'piglatin'. Thus DIG 
becomes IGDAY, DOG becomes OGDAY and CAT becomes ATCAY. In 
general, it permutes a 3-character string and appends an 'AY'. 


Another example of a positional transformation, one chosen 
from a more practical point of view, is the translation from 
ASCII to EBCDIC (see [IBM360a], App. F and [ASCII )). This 
transformation is indicated graphically in Figure 3.2. It, 
for example, transforms the ASCII code 1010101 to 10110101. 


A call to the replace function REPLACE(S,,S5,S3) is said to 


exist in Sə the last appearance of each character will in- 
dicate the mapping. In this latter case the operation of the 
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uy 
cc 
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UJ 
r 
MY 1 
er 
Figure 3.1 


A positional transformation that translates three- 
character words into their pig-latin equivalent. 


function would not be ambiguous although the programmer's 
motives might be. 


As we have described earlier, every transformation T defined 
as 


T(S) = REPLACE(S,S,,So) 


is a transliteration provided the operation is well-defined. 
Also, as has been previously noted, any transliteration T can 
be written as REPLACE(S,S,,Ss) for some Sy, Sg. Hence the set 
of all transliterations are identical with the set of all 
REPLACE's with given 2nd and 3rd arguments. 


In a considerably less okvious way, the positional transforma- 
tions can also be implemented by the REPLACE function. 


For any strings S,, So, the transformation defined as 
T(S) = REPLACE(S,, So, S) 


is a positional k-transformation on S where k is the size of 
Sas 


Conversely, any positional transformation satisfying certain 
size constraints can be written as a REPLACE. Let P(S) be a 
positional k-transformation. Let S, be a string composed of k 
different characters none of which are included in the 
constant characters of the mapping. Then we can express P as 
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rn —— (777731 
{ |——r— | | 
LLLA | LL 
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Figure 3.2 


A positional transformation for converting  ASCII 
to EBCDIC. 


| ee CCCII ML IMEEM ee” ey at een | 
| P(S) = REPIACE(P(S,), $1, S) | 
LLL eum a E E | 


Like the transliterations, we need only obtain the positional 
transformation for one model string to set up a high speed 
program for transforming all strings in the domain. 


As an example, the transformation indicated in Figure 3.1 can 
be expressed as 


REPLACE('OGDAY'!,'DOG',S) 


As another example the transformation indicated in Figure 3.2 
can be expressed as 


REPLACE ('12134567', '1234567', S) 


The characters in the model string must all be different from 
any constant characters added to the string. Moreover, the 
characters in the model string must all be different from each 
other except that characters corresponding to positions that 


— ane: mee CEE CREP CA GE O GC O GE EPS ED A a AO O AEE ED ED EE ED ED PA SEED OD ED CERTUS CR UHP CEE AO EU OTe GE 


are dropped may be duplicates of other characters which follow 
them. Thus 


REPLACE('XY','XYYYY',S) 


will extract the first and last characters from S provided S 
is 5 characters long. Therefore, the size constraints imposed 
by the REPLACE function are that the total number of charac- 
ters in the second argument (i.e. k) plus the number of  dif- 
ferent constant characters added in the mapping minus the 
positions ignored plus 1 if the last position is ignored 
should not exceed the size of SALPHABET. 


A permutation of a string is simply a rearrangement of its 
characters and clearly this is a special case of a positional 
transformation. String reversal, of a constant length string, 
is a permutation and hence can be accomplished by using 
REPLACE with suitable 1st and 2nd arguments. But string- 
reversal of arbitrary length strings represents a class of 
permutations and for this reason REVERSE must prepare ap- 
propriate 1st and 2nd arguments depending on the particular k- 
transformation it must deal with. But this preparation is 
rapidly accomplished by a simple fixed-length pattern matching 
Operation. 


[ 7 ee 

(| Program || BLEND (X,Y) will merge the two strings X and 
E 3.7 E Y taking the first character from X, the 2nd 
{| BLEND BE from Y, the 3rd from X, etc. Thus 
AAA BLEND('ABC*,'123') equals 'A1B2C3'. BLEND 


has been used previously by the HEX function (Program 2.6) and 
is an example of a class of positional transformations which 
can be programmed to run quite rapidly. The 2 strings X and Y 
are either the same length or X is one character longer than 
Y. Thus BLEND ( 'CHAPTER',  DUPL(' ',6)) will return 
'C HAPTER’. BLEND's of strings not satisfying these 
constraints are undefined. 


AT I A a A ER SIE aS a) | 
| BLEND(S1,S2) will blend the two (equi-length) strings S1 | 
{ and S2 such that every other character is taken from each | 
{ string. Thus BLEND('ABC*,*123*) will return 'A1B2C3!. | 
e — —————-— —— MÓ— ——— a | 


DEFINE('BLEND(S1,S2) T1, T2, ABC, XYZ,L 1, L2!) 


NC CECI CC CN E ELLE CDD LC CDL DI E C CC 11ae | 
| Prepare in BLENDED ALPHABET a blend of the lower and upper | 
{ halves of SALPHABET. | 
A EAS | 


SALPHABET  LEN(128) . ABC LEN(128) . XYZ 


BLE 1 ABC LEN(1) . Tl = :F(BLEND END) 
XYZ  LEN(1) . T2 = 
BLENDED ALPHABET = BLENDED ALPHABET T1 T2 


< (BLE, 1) 


A EEE oe AS AEE ree UND AO. Cam O AES ES PONS EPS AO PED A ED > O CU MES-ES SEE UNES 


SS SS M CC eee 
| Entry point: If S1 is too large, subdivide and recurse. | 
CERTI E C A A ( —rá— ee a en 


BLEND LI = SIZE(S1) 
GT (L1,128) :F(BLEND 1) 
EQ(L1,0) : S (RETURN) 
S1 LEN(128) . S1 REM. T1 
S2 LEN(128) . S2 REM. T2 
BLEND =  REPLACE(BLENDED ALPHABET,&ALPHABET,S1 S2) 
+ BLEND (T1,T2) : (RETURN) 


ae E CRM y ge IC CE MN DECEM MCI M MEM PC CMS CMM CM ey | 
| Otherwise prepare AXBYCZ to be a BLEND of ABC and XYZ and | 
| to be as long as the string to be returned. These strings | 
| serve as a template for a positional transformation of the | 
| combined string S1 S2. | 
-——————————————————————————— a Ee | 
BLEND_1 L2 = SIZE(S2) 

ALPHABET  LEN(*L1) . ABC TAB(128) LEN(*L2) . XYZ 

BLENDED ALPHABET LEN(*(L1 + L2)) . AXBYCZ 

BLEND = REPLACE(AXBYCZ, ABC XYZ, S1 S2) 

: (RETURN) 

BLEND END 


Epilogue 


The initialization section of BLEND prepares a string 
BLENDED_ALPHABET which thereafter is used to obtain templates 
for a positional transformation. For very large strings BLEND 
is called recursively. As in REVERSE, this is done because of 
limitations in the size of SALPHABET rather than due to any 
difficulties or limitations in handling long strings in 
SNOBOL4. A slightly faster version of BLEND can be achieved 
by nonrecursive methods but it seems hardly worth it. 


AS | 

(| Program || BALREV(S) will return the balanced reversal 
(1! 3.8 E of the string S. That is, the characters of 
(|!  BALREV N S are reversed and the parenthesis are in- 
———————————— terchanged. For example, BALREV('F(X)') is 


'(X)F' rather than ')X(F' as would be returned by REVERSE. 
BALREV can be used to reverse the order of scanning in an en- 
vironment in which BAL plays a role in the pattern matching. 
For example 

S '(' BAL. E ')' 
will find the first parenthesized expression in S, whereas 


BALREV (S) '(* BAL. E 'yt 
E = BALREV(E) 


will set E to be the last parenthesized expression in S. 
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E TT Fae eee ES 
| BALREV(S) will return the balanced reversal of S. | 
A ES E E A ER SEER cere eee ee 


DEFINE (' BALREV(S) *) : (BALREV_END) 
BALREV BALREV = REPLACE(REVERSE(S), ')(', '()') 
: (RETURN) 
BALREV_END 
Names referenced Name Type Where defined 
by BALREV: REVERSE Function Program 3.6 
Epiloque 


BALREV is not of interest because it offers a challenge to 
one's program-writing abilities but rather because of the 
general notion of balanced reversal that it introduces and the 
fact that we will have occasion to make use of the function in 
later chapters. It is also of interest in that it provides in 
one line of code not only a useful function but one which uses 
both a transliteration and a positional transformation. 


— — — —— | 


ps 
(|! Program |i (available in SPITBOL and SITBOL) 
E 3.9 11 SUBSTR(S,I,L) will return a substring of the 
If SUBSTR ti string S beginning at character I and exten- 
1 a eae Ging for L characters. If such a string is 
not properly included in S then SUBSTR fails. The SUBSTR 


function was patterned after the function by the same name in 
PL/I. Although the taking of a substring is a capability im- 
plicit in the pattern-matching facilities of SNOBOL4, its 
availablity as a function offers another dimension to this 
most fundamental of string operations. 


RANA RR EE PIRATA. 
{| SUBSTR(S,I,L) returns a substring of length L beginning at | 
| the Ith character of S. i 
| ——— —————— ——— —————————————— w——————— A | 


DEFINE('SUBSTR(S,I,L) ') : (SUBSTR, END) 
SUBSTR S LEN(*(I - 1)) LEN(*L) . SUBSTR :S(RETURN)F (FRETURN) 
SUBSTR END 


Cee ie MID 

Il Program |i We may regard a string as a set of charac- 
B 3.10 E ters if we ignore duplicates and their 
E DIFF E ordering. The fundamental set operations 
_ ——— are union, intersection and complementation. 
String concatenation gives us union. Intersection can be ob- 


tained from union if we also have complementation.  Complemen- 
tation can be obtained if we have the universe string (set of 
all characters) and set difference. SALPHABET serves as the 
universe and DIFF(S1,S2) will return the set difference, S1 - 
S2. That is, DIFF(S1,S2) returns a string containing all those 
characters that are in S1 and not S2. 
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DEFINE('DIFF(S1,S2) ') : (DIFF END) 


SO A II LI MCCC M MCCC ME 
( Entry point: set DIFF to S1 and then remove any consecu- | 


| tive string of S2 characters. i 
AAA E -— ————É—— — ————— —— ———— ———— !ÉO— ————Ü | 


DIFF DIFF = 3S1 

IDENT (S2,NULL) :S (RETURN) 

S2 = SPAN(S2) 
DIFF_1 DIFF S2 = :S(DIFF 1) F (RETURN) 
DIFF_END 
Cy co ee SU 
(|! Program || SKIM(S) ‘skims off' the first appearance of 
li 3.11 N each different character of S and returns 
E SKIM E the result. Thus SKIM('MISSISSIPPI') returns 
——?Y———ÓÀ— !' MISP'. 

DEFINE('SKIM(S) C!) : (SKIM, END) 


| 
| Entry point: Remove character from S and if not already | 
| in SKIM, put it there and repeat. | 
| TI -————— ——— — — —— —— á—Ó—————iÁ—e—ÀÓ 


SKIM S LEN(1) . C = :F (RETURN) 

SKIM C ¿S (SKIM D) 

SKIM = SKIM C : (SKIM) 
E E E EA | 
| But if C was found in SKIM, it may be prudent to remove | 
{ all characters already SKIM'ed from S. { 
| AAA a A | 
SKIM_D S = DIFF(S, SKIM) < (SKIM) 

SKIM_END 

Names_referenced Name Type Where defined 
by SKIM: DIFF Function Program 3.10 
Epilogue 


SKIM is slightly more complicated than it has to be. The line 
at SKIM_D is not strictly necessary and the statement that 
branches to SKIM_D could as well branch to SKIM. But for ef- 
ficiency purposes it is better to remove already-skimmed 
characters in the wholesale manner of DIFF rather than pain- 


fully, one at a time. The technique used in SKIM is to call 
DIFF whenever an old character is found. This will be an im- 
provement even if it takes relatively long to call DIFF. If 


the ratio of times of calling DIFF vs. going through the loop 
is 5, then it will pay if as few as 5 characters are removed 
from DIFF. It is possikle, however, that the calls to DIFF 
are too frequent. It may be better to call DIFF only when, 
say, 2 characters in a row have already been found. 


A ES LES TELS A O EATS SLED AEDS MO ETD SD SEP IMG AS EP FD SED O A O APA CEP AAA EES AA SS aa 


ES 
If Program |! There exists a built-in function in SNOBOLS 
ii 3.12 1 called LGT. LGT(S1,S2) is a predicate which 
li LEXGT li will succeed if string S1 is lexically 
| greater than S2 and fail otherwise. The 


determination of lexical ordering is based on &ALPHABET which 
is machine dependent and may not represent the desired 
ordering. In particular the lower case alphabet appears 
separate from the upper case alphabet so that all upper case 
letters are regarded as greater than all lower case letters. 
Thus, 'Arabic' is considered greater than 'zebra'. The func- 
tion LEXGT which we define below will differ from LGT in that 
the lexical ordering will not be based on &ALPHABET but on a 
user-supplied transliteration table: LEX TT. 


rg ae OE X E MD C DG CIE EL NIMM RCM REM MCCC MK EN LECCE KL MC M DIM Oe nag CENE 
| LEXGT(S1,S2) is a predicate to determine whether S1 is | 
l lexically greater than 3S2 according to a user-supplied | 
| transliteration table in LEX TT. l 
A ————o[—— — —J————————— AE | 
DEFINE ('LEXGT (S1, S2) ') 
RS E ee ee ee DEM CIN OTR C MMC ae | 
(| As an example, we will initialize LEX TT to a value such | 
| that upper and lower case letters of the same letter will | 
| be regarded as being adjacent. Also letters will compare | 
| lower than anything else. First form, in ALPHA, the new | 
f alphabetic ordering. | 
 ————Á——————————————— ———— —————————————————— | 
ALPHA =  BLEND(LOWERS, ,UPPERS ) 

+ DIFF (S&ALPHABET, LOWERS_ UPPERS ) 

Qe ee eee ee Oe E MAA. 
| Now transform this string to form a transliteration table. | 
ET | 

LEX TT = REPLACE(SALPHABET, ALPHA, &ALPHABET) 
: (LEXGT END) 
Ce oe fe ee M MM IM nF ee Ne I MEE DEC CN ADMIN MCCC EM epee 


| Entry point: translate and compare. i 
MESURER UM DENEN OUI MITES NO ERN GO TRENT E MN NRI 


LEXGT LGT( REPLACE(S1, &ALPHABET, LEX TT), 

* REPLACE(S2, SALPHABET, LEX TT)) 

4 :S (RETURN) F (FRETURN) 

LEXGT END 

Names referenced Name Type Where defined 

by LEXGT: BLEND * Function Program 3.7 
UPPERS * String Program 2.1 
LOWERS  * String Program 2.1 
DIFF * Function Program 3.10 


* indicates name is referenced in the initialization section. 


Epiloque 


We have effectively modified LGT by modifying its arguments. 
In many problems this could be carried one step further for 
greater efficiency. Assume that all the data that would ever 


appear for comparison purposes is coming from the normal input 
Stream (under INPUT). We could convert characters as they were 
being read in via a statement such as 


L = REPLACE(INPUT, &ALPHABET, LEX TT) 
But were we to do this we must be careful in using pattern 
matching so that all character strings used to specify pat- 
terns were also mapped in the same way. Thus to match the line 
L for 'CAT' we would have to write: 


L REPLACE('CAT', SALPHABET, LEX TT) 


(xL 

(| Program l| One might suspect that LEXGT provides max- 
E 3.13 06! imum flexibility in the comparison of 
li AGT 1 strings, since one may supply one's own al- 
L_______________..__4 phabet. But it does not handle the important 


case in which certain distinct characters are to be regarded 
as identical for comparison purposes. In particular, the lower 
case 'a' and upper case 'A' are normally regarded as equal for 


dictionary purposes. LEXGT would sort words 
table, Afghan,artist* as 'able,artist,Afghan' which is not the 
dictionary ordering. AGT(S1,S2) will compare 2 strings and 
return success if S1 is alphabetically greater than S2. AGT 


is blind to the distinction between upper and lower case. 
Otherwise it accepts the ordering implied by SALPHABET. 


a a EE | 
| AGT(S1,S2) is a predicate to determine if $1 is al- | 
{| phabetically greater than S2. Upper and lower case  ver- | 
| sions of the same letter are regarded as equal. | 
———Ó————————————————»—— — ——-—'íuÓd"— | 


DEFINE('AGT (S1,S2) ') 


AGT TT = REPLACE(6ALPHABET, UPPERS , LOWERS_) 
: (AGT END) 
AGT LGT( REPLACE(S1, SALPHABET, AGT TT), 
+ REPLACE (S2, ESALPHABET, AGT TT)) 
+ :S (RETURN) F (FRETURN) 
AGT END 
Names referenced Name Type Where defined 
by AGT: UPPERS * String Program 2.1 
LOWERS  * String Program 2.1 


* indicates name is referenced in the initialization section. 


Epilogue 


AGT and LEXGT provide 2 distinct means whereby one may alter 
the effective behaviour of LGT. If necessary, these 2 methods 
may be combined into one suitably-designed call to REPLACE. 
We leave this as an exercise. 
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N SWAP (NAME1,NAME2) will swap the values of 
3.14 0t the named variables. Thus, SWAP(.N,.M) will 
f ! 


SWAP interchange the values of N and M. 

DEFINE ('SWAP (SWAP_ARG 1, SWAP_ARG2) *) : (SWAP_END) 
SWAP SWAP = $SWAP_ARG1 

$SWAP_ARG1 = $SWAP_ARG2 

$SWAP_ARG2 = SWAP 

SWAP = : (RETURN) 
SWAP_END 
Epiloque 


The names of the arguments to SWAP were deliberately chosen 
strange so as to avoid collision with the outside world. The 
variable SWAP is set to null before returning because other- 
wise a value would be returned and it is conceivable that in 
some cases this would not ke desirable. 


Con ene ae A 

{| Program l REPL(S1,S2,S3) will do a string-by-string 
E 3.15 11 replacement (as opposed to a character-by- 
(| REPL A character replacement ala REPLACE) on the 
—— MIA string S1. The string S1 is scanned for 
instances of the string S2 and each is replaced by S3. Por- 


tions of S1 already scanned and the replaced string are not 
reexamined for instances of S2. 


DEFINE ('REPL(S1,5S2, S3) C, T, FINDC!) : (REPL, END) 
(oO UE RC T Ve LPS TL MM mM RCE KS. T E A UM VO IRE ORUM I e sur Mirta do COE T E T 
| Entry point: Define pattern FINDC which will do a fast | 


| scan for the initial character. | 
| e—— ————— n ———————————————————— A | 


REPL S2  LEN(1) . C = : F (FRETURN) 

FINDC = BREAK(C) . T LEN(1) 

S2 = POS(0) Ss2 
| ARIES a MM ae Fae zc c c ANS | 
| Top of loop: First remove the prefix, T; then test for | 
| s2. | 
| —— ———————————————— ———— ——X—————— A? 
REPL 1 S1 FINDC = :F (REPL 2) 

S1 S2 = :F (REPL. 3) 

REPL - REPL T 3S3 s (REPL, 1) 
REPL 3 REPL = REPL T C : (REPL_1) 


SS SS SS SS SS N 
| Return point: The lead character, C, was not found in S1. | 
AAA A A a ee SN: 


REPL_2 REPL = REPL S1 : (RETURN) 
REPL_END 
Names referenced Name Type Where defined 


by REPL: BREAKX Function Program 8.2 
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Epilogue 


like the function COUNT, the technique used to speed the 
search is to do a fast scan (at BREAK speeds) for the initial 
character. Other than this, the coding is straightforward but 
surprisingly lengthy. 


OS | 

(! Program 11 QUOTE (S) will convert its argument to a 
E 3. 16 N string which will resemble a SNOBOL4 expres- 
li QUOTE B sion which, when evaluated, will yield the 
¡ AA original string. In the simplest case 


QUOTE (S) will place the string S between apostrophes. However, 
if S contains apostrophes, QUOTE will enclose these within 
double quotes. Thus 


OUTPUT = QUOTE ("DON'T") 
will print 
'DON' wn emt 


Note that EVAL(QUOTE(S)) is always equal to S. QUOTE is useful 
when preparing code. An example is given in RSELECT (Prog. 
16.7). 


DEFINE (' QUOTE (S) S1,Q0,00*) : (QUOTE_END) 
C E ee ea E E E E E | 
| Entry point: The only thing that gives us any trouble is | 
( the single quote. If we find one we must wrap it in double | 


| quotes and offset it with blanks. | 
| xe MIC II — m A | 


, 


QUOTE Q = tem + Q = etme 
QUOTE = Q REPL(S, 0, Q * ' QQ Q QQ § ' Q) Q : (RETURN) 

QUOTE_END 

Names_referenced Name Type Where defined 

by QUOTE: REPL Function Program 3.15 


LLL 
LLO 2 222222222227 EXERCISES. 272721?7?1?1?71722?1????117?731 
? 


| Exercise 3.1 | Write RPAD in terms of LPAD and REVERSE. 


E EEEN 
{| Exercise 3.2 | Write RPAD in terms of LPAD and ROTATER. 
t—-—— Assume that SIZE(S) < N. 


Qe ee Ne 
| Exercise 3.3 | Write a function CENTER(S,N,C) for centering 


CLL——————————-4 objects within a field of width N. 


| Exercise 3.4 | Use the REPLACE function and BLEND to 
CLL——————————-A rapidly extract every other character from 
the string S, starting with the first (Assume that SIZE(S) is 
less than 2 * SIZE(&ALPHABET) and can be even or oda). This 
can be done in 2 statements. 


go ar T T m rre 
| Exercise 3.5 | a) Determine Sa and So so that 
t———— REPLACE (S,,So,S) realizes the positional 


transformation shown in Figure 3.3. 


b) What is the fewest number of different characters needed 
in S, and Sg. 


(7777771 nr 7771 
| { r—> | | 
uy { | re | 
| 
cr | cc 
| l—— | IXI 
l | 
ra i | cc 
| I—M——1——————»1 | 
L———1 | LL 
| 
ra | °c 
| i C214 | 
LLLI LL ———J 
Figure 3.3 
[ep 
| Exercise 3.6 | a) Using REPLACE, obtain the last charac- 


t———— ter of string S. 


b) In a similar way extract the Kth character. 
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CIA aye CI 
| Exercise 3.7 | Some cyphers (called Transpositional) serve 


VJ to encode text by rearranging characters 
(see for example Smith [ 1955] ). The message is written ina 
rectangular matrix horizontally from left to right. The 
encoding is obtained by reading vertically. Thus, if the 
matrix is 2x6 and the message is 


ATTACK 
ATDAWN 
the encoding is 
AATTT DAACWKN 
a) Write a function TPOS(S,H,W) to encode the string S. H 


is the height and W is the width of the matrix and S is as- 
sumed to be exactly H * W characters long. 


b) Using TPOS, find S, 6 S, such that REPLACE(S,, Sə, S) 
will convert all strings of length H * W (Assume that H * W 
does not exceed SIZE(&ALPHABET)). 


C) Using the scheme of b) write a function ENCODE which will 
encode arbitrary length strings. Trailing characters are 
ignored. Thus, if the matrix is 7x3 and the message is 


THEBRIT 
ISHAREC 
OMING 


then the encoding is 
'"TIOHSMEHI BANRRGIETC! 


(Hint: assume some character exists, say colon (:), which will 
never appear in the string to be encoded). 


SS _.. . SN 

| Exercise 3.8 | a) Extend BLEND(X,Y) so that if string X is 
AS n times longer than string Y then the 
characters of Y will be inserted at every  (n*1)st position. 
Thus BLEND ('*ABCDEF', *123*) will return 'ABICD2EF3'. For ef- 
ficiency purposes, a takle of templates may be stored for the 
positional transformations. 


b) How would the new BLEND be used in the encoding of  TPOS 
(see Exercise 3.7). 


[EL p 

| Exercise 3.9 | Assuming a function OR(S1,S2) is available 
AÑ for ORing the bits of the equi-length 
character strings S1 and S2 (at high speeds). Rewrite CH 
(Program 2.7) so that it performs at high speed using the 
REPLACE function. 
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| Exercise 3.10 | E contains a string representing a Fortran 
AS arithmetic expression which consists, pos- 
sibly, of the sum or difference of expressions E1 and E2. 
Keeping in mind that Fortran associates operators from left to 
right, parse E assigning to E1 and E2 the proper values. If E 
is not of this form go to label NOT. 


i... UTERE ur SESS | 
| Exercise 3.11 | Design a 'worst-case' (time-wise) string 


AS argument for SKIM that is 20 characters 
long. 


Go en ee | 

| Exercise 3.12 | Any string may be said to denote a set of 
CAS Characters, viz. the set of which it 
consists. Assuming that the strings denoting sets may have 
duplicate characters, write an expression to express the a) 
union and b) intersection of 2 sets S1 and S2. c) Write an 


expression to indicate the negation of S. d) Write an expres- 
sion which succeeds if set S1 equals set S2. 


| adu AA an 
| Exercise 3.13 | Write an expression which will succeed if 
(A there are no duplicate characters in the 


string S (you may use functions defined in this chapter). 


[T ie a ee ee ee 
| Exercise 3.14 | Write an expression to obtain the set of 


t-—————————————2 characters that occur exactly once in a 
string S. 


[7 a SERUUM 
| Exercise 3.15 | (a) Remove leading O's from a string by 
A Means Of TRIM, REPLACE, and REVERSE. (b) 


Remove leading O's from a numeric string S (one capable of 
being converted to integer) by means of a single operator. 


Ce eee A 

| Exercise 3.16 | AGT and LEXGT represent 2 methods of effec- 
tL————————————-4A tively modifying the lexical comparison. 
To generalize, let the string ALPHA denote an alphabetic 
ordering as follows. Sets of equal letters are enclosed in 
parenthesis. Otherwise the lowest to the highest character 
are ordered left to right. Characters not in ALPHA may occur 
in any order. Thus 


ALPHA = ' (Aa) (Bb) (Cc) (Dd) (Ee)... (Zz) 0123456789" 
would describe an ordering in which all the alphabetics appear 
before the numerics and in which the alphabetics are grouped 


in their normal order. (a) Write a program to convert a string 
such as ALPHA into a pair of strings A1 and A2 such that 


LGT( REPLACE(S1,A1,A2) , REPLACE(S2,A1,A2) ) 


A A DIU AAA AA A ESE SE CLD GED CATE TNE? ARIA? ER ETT ARE A Ga AAA O ASCO O EPA GEDND O A AA ED UD, TO UE BEI 


will compare strings S1 and S2. 


(b) If parenthesis themselves are to be included in the 
characters to be explicitly ordered a difficulty arises. 
Establishes escape conventions for parens and modify your con- 
version program accordingly. 


Cs oe en ee | 
| Exercise 3.17 | What 3 variables may not be swapped using 
AS SWAP? (Prog. 3.14) 


A AAA. 

| Exercise 3.18 | Assume that input text, contained in the 
AA string S, is a personalized message to some 
one or some organization. Within S, and embedded within paired 
ts are SNOBOLU expressions to be evaluated on an individual 
basis. The rest of the text is constant for each message. 
This text may have quotes embedded within it but not #'s. 
Compose, in Q, a SNOBOLU expression which when evaluated will 
yield the desired string. For example if S is: 


DEAR MR. #NAME#: 
then a correct translation is 


"DEAR MR. ' NAME ':' 


SS 
| Exercise 3.19 | State which of the following are homomor- 
A  phisms (h) and which of the homomorphisms 
are also transliterations (ht). (a) UPLO, (b) BCD_EBCDIC, (c) 
ROMAN, (d) HEX, (e) CH, (f) QUOTE 


Ko 

| Exercise 3.20 | Some systems accept abreviations of all 
AÑ command names. For example, DEL, DE or even 
D would be acceptable abreviations for the DELETE command 
provided this uniquely specified the command. Given a list of 
commands in the string CMD such as: 


CMD =  ',ALLOCATE,AUGMENT,BEGIN,CHANGE, +... ' 


write a function C(S) which will determine if a given string S 
uniquely specifies a command. If it does C should return the 
command. If it does not it should fail. Hint: using COUNT 
(Prog. 3.4) the body of the routine can be written in one 
Statement. 


CS ee ee 

| Exercise 3.21 | Assume that X and Y are string-valued. In 
t———— one statement, swap X and Y without using a 
third variable. 
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ED amm atu CODE) RANE AED SD ES EDS La A EES ES CEN SSE ELD AED GED 


we ee ee Pe 
| Exercise 3.22 | What is 
LC a | 


the value of 
SIZE (QUOTE (QUOTE ('X*))) ? 


CHAPTER FOUR 


BASIC 


lemni ical teal T1 NZ 
(6090651 [91 (01 01091 091 N/A 
sp INN INN [HI I! 
Ie INNS TENN Tent 1! 
cs os U cs U us us us us 


CONTENTS 


CRACK .ccccccccccccccccee 4.1 
STRINGOUT +oooooooronor.o... 4.2 
SEQ ecccccccacscccccccscce 4.43 
AOPA scccccccccccsccccsece 44 
FIND .ccccccccccccccccccs 4.5 
AI cacccccscccccccccsesce 4.6 
TRUNC .oooorooorcarrrsrrss 4.7 


CATA @eaeeeeededteeee@eeeeeeted#e#eeeee € 4.8 
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li | hile strings are convenient for representing input 
IIZNII data and for economizing on search time when scanning 
{//\\| for patterns, arrays are quite useful when it is 
{7 NI necessary to randomly alter selected portions of the 
us ts interior of the structure. Arrays are also convenient 
when dealing with sequences of things other than characters, 
such as numbers, patterns, and strings themselves. 


To effectively use the array facility in SNOBOLU it is impor- 
tant to have some conception as to how arrays are implemented. 
The 3 statements below allocate an array and assign values to 
its first 2 elements. Figure 4.1 indicates the data configura- 
tion after the statements are executed. 


ALPHA = ARRAY (4) 
ALPHA<1> = 16 
ALPHA<2> = ‘ABC! 
a 
| l 
{ A | * [een 
poe | 
| ALPHA | | 
AA i 
v 
St 
{| 141414444417 À 
SS 
<1> | I | 16 i 
SS 
<2> | S | * |—> "ABC! 
pp 
«3» | S | 0 | 
E EE 
«uU» | S | 0 
LLILIIAJ I LIII ———— | 
Figure 4.1 


The data configuration after an array allocation 
and 2 element assignments. 
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The array is a data object of type ARRAY (denoted by A in the 
datatype field of the descriptor in the variable ALPHA). The 
data object has information (denoted by cross hatching) to in- 
dicate its physical extent and upper and lower bounds. In 
addition, for every array element, there is one descriptor. 
Hence, each array element may be assigned a data object of any 
datatype; also, the objects may be of mixed type as the  exam- 
ple illustrates. Thus, an array in SNOBOL4 is more properly 
regarded as an array Of variables rather than as an array of 
data. The default value of array elements is the null string 
denoted by (S,0) in the figure. 


Since an array iS a value, it may readily be passed from 
variable to variable. The data configuration resulting from 
the following statements is indicated in Figure 4.2. 


BETA = ALPHA 
BETA<1> = 3.7 
So 
| ! 
(A | = SS 
I 
| ALPHA | | 
AR i 
l 
| 
| 
l 
eae y 
l |I 
i tI 
{ vv 
| AAN 
| | «4447474449 | 
ae ug | A A 
! l {<i> [RI 3.7 | 
l S E. 
IAI * Ii—_ «<a> 481 * [——————» "ABC" 
MATER. ES SS 
( BETA | <3> {1S | 0 | 
ES A, E 
<u ( S | 0 
— —— 
Figure 4.2 


The data configuration after an array assignment 
(to BETA) and one element assignment. 


The assignment to BETA is accomplished only by copying the 
descriptor in ALPHA, not by copying the array. Thus, a 
reference to BETA<1> becomes also a reference to ALPHA<1>, so 
that modification of BETA<1> implies modification of ALPHA<1>. 
This sort of collision can be avoided by use of the COPY func- 
tion. Figure 4.3 illustrates the data configuration which 
results by executing the following 2 statements in place of 
the above 2. 


BETA = COPY (ALPHA) 
BETA<1> = 3.7 


The array elements are variables and hence may be assigned any 
data objects as value, including an array. For example 


ALPHA<X2> = BETA 
will result in the data configuration shown in Figure 4.4. 


Compared with the rather rich string-handling facilities in 
SNOBOL4 there is a relative lack of such facility with respect 
to arrays. Arrays may be allocated; they may be assigned 
values and these values may later be examined; and the size of 
the array may be obtained via the PROTOTYPE function. But few 
operations are supported that deal with arrays as an entire 
entity. Arithmetic operators may not be applied to arrays. 
Arrays may not be scanned for patterns; they may not be trim- 
med, or concatenated or truncated other than as the programmer 
may provide these facilities himself. 


But the way in which arrays have been implemented in SNOBOLU 
does provide the basis for forming a more elaborate array- 
processing facility. Because arrays are represented via a 
pointer, they can readily be passed to and returned from 
subroutines; the time-consuming overhead of copying arrays 
across the boundaries of the call does not exist. Also, and 
perhaps more importantly, the user need not specify the size 
that the returned array is to be, nor need he specify the na- 
ture (i.e. the datatype) of the array elements. Indeed, the 
value returned may be scalar or array with the decision depen- 
ding on what happens at execution time. Array elements may be 
mixed, some being string, some, integer and some, even array. 
With many of the normal restrictions removed, the user if free 
to concoct seemingly wild and fanciful operations upon arrays, 
manipulating these data objects with a degree of freedom that 
one normally associates only with strings. Several examples 
of this sort of thing follow. 


The use of descriptor notation can be cumbersome in dealing 
with an array of simple objects such as integers, reals or 
strings. Hence, where the meaning is otherwise clear, we will 
display an array of data objects in the simplified notation 
shown in Figure 4.5b. 


EP ES EP ea CS 


ATES TD AAA CO OA SAD A LS AD IES TO TD ES CTS AP AO CETTE 


« 1» 


«2» 


«3» 


<4> 


| A | 
A — — AY 


BETA | 


Lo J 


This 


function as contrasted with assignment. 


figure 


« 1» 


«2» 


«3» 


«u» 


| c IC ID LEE ea 
| 11114411137 N 


EM ————3À 
III 16 | 


Is | e SS 


| 41144444117 À 


E—4—-———--4A 
| R { 3s 7 | 


Figure 4.3 


illustrates the effect of the COPY 
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EAST 
| | 
poe 
{| A | * SS 
pS | 
{ ALPHA | | 
ML v 
| 114111111117 | 
E E, 
«1^ | I 1 16 | 
I———————— 
<2 {A | * |—————3À 
R — | 
<3> 1S | 0 i i 
kaor | 
«uU» | S | 0 { { 
A AS i 
| 
eS | 
| | | 
EI—9————— —— | 
| A | — | —— _ AAA 
Se t 1 
| BETA | 1 1 
id 1 | 
v v 
PS 
| 444444441 Y 
YA >] 
<> | R |I 3.7 | 
AA 
<2 | S| *————[———» "ABC! 
SSS 
<3 | S | 0 
So 
«uU» | S | 0 | 
A P ^— «e 
Figure 4.4 


The result of executing ALPHA<2> 


BETA. 


misc Program 8.1 > CRACK. -Page 69 


NARA AAA RD, | 
| 1111111411 N cc TO; 
AA <1> 1 "ABLE! | 
<1> 4S | A | > "ABLE! E—— ————————4 
I-——TLT—————- «2» | ‘BAKER! | 
«2» | S| * | > "BAKER! HAM] 
I—À9———————-—— «3» | 3.6 | 
«3» 4 R I 3.6 i jel 
<4> | 16 i 
<u> | I | 16 l AAA aael 

| ee! A | 

(b) 
(a) 
Figure 4.5 


(a) shows the descriptor representation of an ar- 
ray. (b) shows a simplified representation for 
the same array. 


E eg ee 
Program CRACK(S,B) is used to "crack! open the 


B E 

li 4.1 E string S and assign its contents to an ar- 
11 CRACK N ray. This array is returned. B is a break 
QA A character which serves to separate items in 
the string. The caller has the option of ending the string S 
with a break character, If none exists, CRACK will append one 
before further processing. Thus 


CRACK (*ABLE BAKER CHARLIE',' ') 


will return the array 


Ce ee E ES 
<1> | ' APBLE' | 
I —— ——À 
«2» | ' BAKER' | 


HA 
«3» | "CHARLIE! | 
ÓN | 


If Bis null, the individual characters are cracked apart. 


a —— ———Ó————————Áá 
| CRACK (S, B) will convert from string to array breaking at | 
I| the character B. | 
— A O O O A A | 
DEFINE ('CRACK (S,B)I,PAT'!) : (CRACK_END) 
AS 


| Entry point: If B is null branch off to CRACK 1. l 
Lonca 


CRACK IDENT (B, NULL) :S (CRACK, 1) 
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ee eR MID ME NM EM AA AN 
| If S does not end with a break character append one. | 
e LIV A ME MEME dE MM O 


S | RTAB(1) B APORT | REM. S = SB 
SR O UNIDO CC ee EMO ee pe VC PS 
( Then prepare an array (CRACK) of appropriate size and as- | 
l sign to the variable PAT a pattern to extract substrings | 


| from S. | 
ee eed 
CRACK = ARRAY (COUNT (S,B) ) 
PAT = BREAK(B) . *CRACK<I> LEN(1) 


SSS ee CM MR MADE 
| Merge here from CRACK 1. Remove the strings and insert | 
( them into CRACK. Return when S is exhausted. | 
Ell ul E d e OEE AEE ERSE eM ELE C Ed 
CRACK 2 I = qo owe 

S PAT = :S (CRACK_2) F (RETURN) 


pe MM dI c "Iac ge a Be e D Pe ye, ere Te apt D X MCCC EIS 
| If no break character,allocate CRACK and assign pattern to | 
| PAT. This pattern will strip individual characters from S. | 
——— — ————————————————JÀ———— "a — ——— —— M], 


CRACK, 1 CRACK = ARRAY(SIZE(S)) 

PAT = LEN(1) . *CRACK<I> : (CRACK_2) 
CRACK_END 
Names_referenced Name Type Where defined 
by CRACK: COUNT Function Program 3.4 
B Program 11 STRINGOUT(A,SEP) will serve to convert 
11 4.2 l1 from array to string. SEP contains a 
ii  STRINGOUT || separation string to be inserted between 
———————————À strings of the array A. Thus if A is an 


array with values 


(S ee a ig ee M 
<1> | 'CAT? 
tf 
<2> | "Doc! { 


eH 
<3> | ' MOUSE! | 
Lc) 


then STRINGOUT(A,',') will return 'CAT,DOG,MOUSE!. A is as- 
sumed to be singly dimensioned with lower bound 1 and composed 
of strings or items which can be  concatenated. Note that 
STRINGOUT( CRACK(S,B) ) will return S provided that S does not 
end in B. Note also that STRINGOUT( CRACK(S B,B) ) will always 
return S. 


rg ge II QE I PIC PO ae eee yo [Ec 
| STRINGOUT(A,SEP) will convert from an array of strings to | 
l a string. SEP will serve to separate the strings. | 
o Rc -——M——— —————— ———————— —— ————H— ——————————Xt—G| 


DEFINE (' STRINGOUT (A, SEP) I!) : (STRINGOUT END) 


| cc EN MGE ENIM IMG CC DN A CMM CC NM RCM ECCE CC CI C MM | 
| Entry point: Initialize I and STRINGOUT. | 
|r-— ———— A — Hd — Re — M — — — —Á——— A a ———À  — | 
STRINGOUT I = 1 

STRINGOUT = A<1> : F (RETURN) 
E Ac EUMDEM CM CC MA AE CEPI MDC CL ECCO DL E RO | 
| Top of loop { 
a a II A A | 
STRINGOUT_1 I= I+1 

STRINGOUT = STRINGOUT SEP A<I> 
+ : S (STRINGOUT_1) F (RETURN) 
STRINGOUT_END 


A | 

(| Program {| Although it is not conceptually difficult to 
E 4.3 N Sequence through an array, it can bea 
6! SEQ E teđious exercise if it is required that we 
MMAA> do it over and over. This is especially true 


in SNOBOLU which has no DO or FOR statement. SEQ(S,N) provides 
a sequencing capability similar to the action of a DO-loop. 
For example: 


SEQ(* AXI» = I ', .I) 


will initialize an array A such that the Ith element is as- 
signed the value I. The first argument is a statement or 
sequence of statements separated by semicolons. The second 
argument is the name of a variable. The variable is assigned 
the values 1,2,... and the statement or statements are ex- 
ecuted for each such assignment. This is repeated until 
failure is detected on the last statement of the sequence. 
Thus 


SEQ( " A<K> = TRIM(INPUT) ; DIFFER(A<XK>,'STOP')", .K) 


will read cards successively into the array A until either A 
has no more room or the word 'STOP' is encountered on the in- 
put stream. But note that if an end-of-file is encountered 
(INPUT fails) the sequencing will not be stopped. In this 
case, if no subsequent file exists, the program will terminate 
in error. 


If failure is detected on the first attempt to execute the 
statements then SEQ will return failure. This permits compoun- 
ding the iteration as in the following: 

SEQ(" SEQ(' AXI,J> = I * J',.J)", .I) 


The above statement will assign a value (as indicated) to each 
element of a doubly dimensioned array A. 
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ee ee 
| SEQ(S,N) will sequence through a set of statements until | 
| failure is detected. The indexing variable is given by the | 
{ name N. | 
RR a E a Er À 


DEFINE (' SEQ (ARG_S,ARG_NAME) ') : (SEQ. END) 


gue (————————MÁ———— ee ON gee ee Ie ea M 
( Entry point: Initialize indexing variable. Then convert | 
| ARG_S to code. | 
p ROLE T-———— —— ——  ——— cec ——— | 
SEQ $ARG NAME = 0 
ARG S = CODE(ARG S * :S(SEQ 1)F(SEQ 2) ') 

+ < F (ERROR) 
E E A NM ES TA... 
( Increment indexing variable by 1 and spring off to com- | 
| piled code. Return will be to SEQ 1 or SEQ 2. | 
| 


SEQ_1 $ARG NAME =  $ARG NAME + 1 : <ARG_S> 


E E AA. | 
{ Control flows to SEQ 2 if a fail was detected. If first | 


| time through fail; otherwise succeed. | 
| ex EM CT TU —€————————————————— —Ó s | 


SEQ 2 EQ($ARG NAME, 1) :S (FRETURN) F (RETURN) 
SEQ END 

Co. Sa es er EDGE 

|! Program |i Some languages such as PL/I and APL permit 
N 4.4 E arrays to be arguments to arithmetic 
[KK AOPA E operators. SNOBOL4 does not permit such 
AAA operations, but functions can be written to 


serve the same purpose. The resulting function will not be as 
convenient as the built-in facility but it will be at least, 
if not more, general and will be  programmer-modifiable. 
AOPA(A1,0P,A2) will return a new array whose elements are the 
result of applying the indicated operation between  correspon- 
ding elements of the arrays A1 and A2. Both A1 and A2 are 
assumed to be singly dimensioned of lower bound 1. Either A1 
or A2 or both may be scalar. OP is indicated by a string and 
can be any SNOBOLU operator. Thus 


A = AOPA(A, '*', B) 
will add the array A to B. 

C = AOPA(A,' ',',"*) 
will concatenate a comma to every element of the array A. 
CBr ee a ee a E DD C CMM ee SR a ye Men o ee RN me te geet E. 
| AOPA(A1,OP,A2) will apply the infix operator OP to cor- | 


| responding pairs of A1 and A2. An array will be returned | 


| unless both are scalars. { 
AAA a LE Ld 


DEFINE ('AOPA(A1,OP,A2)S1,1,S2,S!) : (AOPA_END) 


A an ae ge CC I C ID E py OE a an ghee ae See ae aes ME 
| Entry point: First check datatypes. If neither is an ar- |! 
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2D a A aE RS enn o TA are 


| ray we fall through the two tests, apply the OP to the two | 


Į scalars and return. | 
| MEE MCI NP TCU NEU" UT ICT" — EMEN, 


AOPA IDENT (DATATYPE(A1), 'ARRAY!) : S (AOPA_1) 
IDENT (DATATYPE (A2), 'ARRAY!) : S (AOPA_2) 
AOPA =  EVAL('A1 * OP * A2!) : (RETURN) 


SSS MM MM MMC HOMINEM MID CCP CC CL MM M ID ACERO 
| Al is an array; A2 is in doubt. | 
AA a NR 


AOPA_1 S1 = '«Dp' 
S2 = IDENT(DATATYPE(A2), 'ARRAY') ‘<I>! 
AOPA = ARRAY (PROTOTYPE (A1)) : (AOPA_COMMON) 


le ee ee ee TES 
| A2 is an array; A1 is not. | 
| — —————  ——————————————á—————— ——— ——————  ——— es 
AOPA 2 S2 = t<p"! 

AOPA = ARRAY (PROTOTYPE (A2)) 
RI | 
f Common code | 
| EIU TEE OFOS A SOMNI ceo a P C ER KT se | 


AOPA COMMON 


S = ! AOPAXTI> = A1! S1 ' ! OP ' A2! S2 

SEQ (S, . I) : (RETURN) 
AOPA END 
Names referenced Name Type Where defined 
by AOPA: SEQ Function Program 4.3 
CoS es Pe 
{{ Program. ||! FIND(A,PRED) will search an array for an ex- 
E 4.5 E treme element. The type of extreme element 
{| FIND E will be determined by the predicate PRED. 
nn eS Thus 


FIND (A, 'GE!) 
will find and return the index of the largest element in the 
array A. Specifically it will return the first element in A 
which is greater than or equal to all elements of higher 
index. 

FIND(A, 'GT') 
will also return the index of the largest element. If there 
is a tie, FIND will return the index of the last such element. 
Thus 

EQ( FIND(A,'GT') , FIND(A,'GE') ) 
may fail, but 
EQ( A< FIND(A, 'GT') > , A< FIND(A,'GE') >) 


will succeed. 


The predicate may be prefixed with the ‘-' operator. Thus 
A< FIND(A, *'-LGT") > 


will return the string lowest in alphabetic order of the 
strings of the array A. 


po E IP 149 12 eee a eye EE | 
{| FIND (A, PRED) will return the index of an extreme element | 
( in the array A as determined by the predicate PRED. | 
A O O A O A A A IS | 
DEFINE ('FIND(A,PRED) EX, 1,MAX,TEST*) : (FIND_END) 
aaa d 
| Entry Point: Construct an expression for comparing 2 | 
| values. Also initialize FIND and MAX, tentatively. | 
| — ——S ——————— —————————— ———————ÀÁ———u — Hr (999 | 


FIND 


EX =  CONVERT(PRED '(MAX,TEST)' , 'EXPRESSION!) 
FIND = 31 
MAX = A<FIND> 
A TT eee E DC an a CE CC ACIDO ee gre ge eg E TE ENANA . 
| Compare MAX with all elements of higher index than FIND | 
| until failure is encountered. If no elements remain, | 
| return. | 
| AAA SI A SS SS DR ENDE O AA | 
I = 1 
FIND_1 I = 1+1 
TEST = A<I> :F (RETURN) 
EVAL (EX) :S(FIND 1) 


aM CMM MD IDCM MMC CM MM CD MC RCM AA 
{| A new extreme element has been found. | 
| ————— ÁÁHÉÓURBÁ"""—!—sÓÀ— Ó—À—€ 
MAX = TEST 
FIND = I : (FIND. 1) 
FIND_END 


Epilogue 


Testing of the array is completed when a reference to A<I> 
(first statement after FIND_1) fails (indicating array 
reference out of bounds). Note that EX has been assigned an 
expression to test MAX against TEMP rather than to test MAX 
against A<I>. The reader might araue that the latter strategy 
is more efficient since it would save one instruction in the 
inner loop. That is, failure of EVAL(EX), in this case, would 
mean either failure of the predicate PRED or array reference 
out of bounds and the distinction could be made afterwards. 
But this scheme would not work because ~LGT(MAX,A<I>) actually 
succeeds if the array reference AXI» is out of bounds. That 
is to say the unary ~ operator does not merely negate the 
predicate, it negates the entire expression. In any case, the 
savings would not be very great. As we will see, assignments 
and statement overhead cost little compared with anything else 
in the language. 
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ll Program |! AI (A, I) (Apply Index) -~ where A and I are 
E 4.6 E arrays will regard I as a set of indices to 
ti AI 11 be applied to the array A. The result is an 
AA array. Thus if 
>: se | 
<1> 1 !' CAT ! | AA 
===] <1> | 3 I 
A = «2» | "DOG! { I = 
p «2» | 2 | 
«3» | 'CANARY ' { EE AAA 


AN | 
the array returned is 


ae SS 
«1» | "CANARY! | 


A 
<2> | * DOG! | 
| LL ————— M— | 


If I is a scalar the result will be AXI». 


[UU UTINAM M PCI M EE 
| AI(A,I) will apply the indices contained in I to the array | 
| A. | 


| ner UE e rr EN cc WE Ee uen 


DEFINE('AI(A,I)J!) : (AI END) 
E CMM INCOME MEE E SS, AS OE EI EC EM MU LEMSCLLM M CLICHM I D EC MCI CM CC CI EM IM | 
| Entry point: If I is not an array, go to AI 1 where we | 


| merely return the Ith element. | 


| ——— ——————— ——————— —————————————— —————] € 


AI IDENT (DATATYPE (I), 'ARRAY') :F(AT_1) 


ye a RE a ww pe NE ee 
| Make AI, the array to ke returned, look like I. Then apply | 
| the indices. 


| ————————————————————————————————Ó 


AI = ARRAY (PROTOTYPE (I) ) 
SEQ(' AI<J> = AXIXJ>> ', .J) : (RETURN) 
AI 1 AI = A<I> : (RETURN) 
AI END 
Names referenced Name Type Where defined 
by AI: SEQ Function Program 4.3 


(| Program |i TRUNC (A, L,H) will return the truncation of. 
1 | 4.7 li the singly-dimensioned array A. That is, a 
E TRUNC E new array will be created and returned 
———— ——HÀ consisting of the elements A<L>, A<L+1>, 


e.e a p A<H>. 
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DEFINE('TRUNC (A,L,H) ') : (TRUNC. END) 
TRUNC TRUNC = ARRAY(H - L + 1) 

L = L- 1 

SEQ(' TRUNC<I> = ALL + ID ',.I) 

: (RETURN) 

TRUNC_END 
Names referenced Name Type Where defined 
by TRUNC: SEQ Function Program 4.3 
aa a ae TS 
(| Program |! CATA(A1,A2) will concatenate the two arrays 
E 4.8 N A1 and A2. Both are assumed singly- 
11 CATA Ii dimensioned of lower bound one. The returned 
——— 4 array also has lower bound one. 

DEFINE ('CATA (A1,A2) I,N1') : (CATA END) 
CATA N1 = PROTOTYPE (A1) 

CATA = ARRAY(N1 + PROTOTYPE (A2)) 

SEQ(' CATAXT> = A1<I> ', .I) 

SEQ(' CATA<N1 + I> = A2XI> , .I) 

: (RETURN) 

CATA_END 
Names referenced Name Type Where defined 
by CATA: SEQ Function Program 4.3 


Cro ee eee 

| Exercise 4.1 | A common problem is to initialize an array 
AS with a large number of strings. Commonly 
this is done with assignment statements but if the list is 
long this technique can prove wearisome. Using CRACK, assign 
an array Of length 12 to the variable M assigning to M<I> the 
name of the Ith month (or an acceptable abbreviation). Thus 
M<1> = 'JAN.', etc. 


os O | 

| Exercise 4.2 | Modify SEQ so that it accepts 2 additional 
AAA (optional) arguments. The first will be a 
lower bound (if not present the lower bound is taken to be 1) 
and the second will indicate the increment (either positive or 
negative). The default increment should, of course, be 1. 


E OS | 
| Exercise 4.3 | Let A be an array with lower bound 1. 
Lo 


a) What will be the result of the following 2 statements? 


N =  +PROTOTYPE (A) 
SEQ(' SWAP(.A<I>, .A<N + 1 - I>)', .I) 


b) Modify the second statement above so that the array A is 
actually reversed. 


 q€=>>>>7>] >>> | 
| Exercise 4.4 | Rewrite STRINGOUT using SEQ. 
AAA AM 


REI 

| Exercise 4.5 | ‘Assume A is an array of strings having a 
3 lower bound of 1. Use SEQ to find the index 
of the first element in A which begins with the character 'M'. 


Co ee ee 

| Exercise 4.6 | Modify AOPA so that if the value of OP syn- 
AS tactically resembles an identifier, it is 
regarded as a binary function. 


R 
| Exercise 4.7 | Is AOPA(A1,,A2) a valid call? If so, what 
AÑ does it do? 


[- TOUT TRE ECT PITE 

| Exercise 4.8 | Write a function OPA(OP,A) which will apply 
AS the unary operator OP to every element of 
the array A. 


[Um 

| Exercise 4.9 | Write BLEND(X,Y) where X and Y are equi- 
CA length strings by an expression involving 
functions defined in this chapter. 


Oo s NR EL ESI MAECEN 

{ Exercise 4.10 | Extend AI to permit I to range over a) 
AJA  2-dimensional arrays, b) multidimensional 
arrays, and c) programmer-defined data objects. 


E O | 
| Exercise 4.11 | The statement 
E a ME RU 


ALPHABET  BREAK(S) LEN(1) . T 
will assign to T the character in S lowest in the alphabet. 
Do the same using FIND and other functions defined in this 
chapter. 


A rn ae ree ee 


| Exercise 4.12 | In TRUNC, the statement L = L - 1 could be 
AS removed if the subsequent statement were 
modified. What modification is needed? Why was it not done 


this way? 
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SSeS 
| Exercise 4.13 | Write a function DO(S,N,L,U,I) where S is a 


AY statement sequence, N isa name, L is a 
lower bound, U is an upper bound, and I is an increment. DO 
should simulate a Fortran DO-loop. 


q 
| Exercise 4.14 | (a) Define a function LBOUNDS(A) which will 


AAA return an array equal to the sequence of 
lower bounds of the array A. Define a function UBOUNDS(A) to 
do a similar thing with upper bounds. For example, 
LBOUNDS(ARRAY('3:10,-1:1')) will return an array containing 
two integers, 3 and -1. 


(D) Write a function INCREMENT(S,L,U,N) which will increment 
and return a sequence of subscripts contained in the array S. 
L is an array of lower bounds as might be obtained from the 
LBOUNDS function of the previous exercise and U is an array of 
upper bounds. N is the size of each of these arrays. The 
function should fail if no more increments remain. 


(c) Using the functions INCREMENT, LBOUNDS, UBOUNDS defined 
above, write a program to print out every item in an array A. 
A may have any prototype but all of its items may be assumed 
to be printable. 


ESI MCN IG RN REESE 

| Exercise 4.15 | Write a function called PUSH(A,E) which 
(AS will push an element E onto an array A 
which is acting like a stack. The first element of A contains 
the index of the last element pushed. If A runs out of roon, 
double its size. PUSH will return A or the newly created ar- 
ray. Routines in this section may be used if applicable. 
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SNOBOL3 had only one datatype, the string. Even the 

arithmetic facilities of SNOBOL3 were implemented as 

operations on strings of digits rather than on machine 

integers. Because of this historical bias, and because 
the language is extaordinarily rich in string handling, 
SNOBOL4 is still regarded by some as exclusively a string 
language. Yet, all the basic facilities which one expects in 
a list processing language have been incorporated into 
SNOROL4Y; these include the automatic allocation and freeing of 
storage, recursive functions, the pointer, and the data struc- 
ture. Moreover, the notation is, for the most part, conven- 
tional, convenient and flexible. Were SNOBOLU suddenly strip- 
ped of all its pattern matching capabilities, it would still 
be a powerful and convenient list-processing language. 


tz he SNOBOL series of programming languages through 
[i 
| 
if 


What do we mean by list processing? This is the kind of data 
processing in which associated data is linked together via 
pointers as opposed to an array organization in which as- 
sociated data is placed in consecutive locations. List 
processing is used whenever the association of data is likely 
to change because such change can be readily accomplished 
merely be modifying links rather than by moving data. 


A list is technically a sequence of items joined together by 
pointers and is really just a special case of an arbitrary 
linked structure. Hence ‘list processing! is a misnomer for 
what might be better termed 'link processing'. However, a list 
may contain items of any kind, including other lists so that 
arbitrary trees may be formed. Hence, a list is more general 
than what is at first blush indicated. Nonetheless, it is im- 
portant to realize that Ly list processing we mean, really, an 
arbitrarily interlaced collection of data objects with the 
possibility of loops and with no restrictions on the number of 
nodes or the number of links per node. In other words we are 
really speaking of arbitrary graphs. 


The method by which one does list-processing in SNOBOLU is via 
the so-called programmer-defined datatype. Calling the func- 
tion DATA, one can define a new datatype. Instances of this 
datatype can be created by making what appear to be function 
calls to the name of the datatype. Thus 


DATA ('LINK(NEXT, VALUE) *) 
L = LINK('XYZ!, 22) 


will first define a datatype called LINK and then assign to L 
an object whose 2 fields (viz. NEXT and VALUE) are initialized 
with the 2 values given as arguments. The result is shown in 
Figure 5.1. 


For convenience we will refer to data objects of this kind as 
structures and to an interlaced set of structures as a data 
configuration. Like arrays, structures consist of a sequence 
Of variables (one created variable for each field) together 


| LINK{ *——— | —— 
kre | 
(LI I 
LI—J i 
l 
| 
v 

os lL | 

| 11141111111 N 

E———— ———À 

NEXT | S | ————— A 
E—————— —4 
VALUE | I | 22 | 


with some miscellaneous information denoted by cross hatching 
in the figure. These fields may be referenced via function 
notation such as 


NEXT(L) = ‘ABC! 
N = VALUE(L) + 3 


Such field references may be used wherever a variable may be 
used, such as on the left hand side of an assignment (as 
above) or on the right hand side of a variable association 
operator (binary . or $). As in the case of all variables, 
the field of a structure may be assigned a data object of any 
type, including another structure. Thus 


NEXT(L) = LINK() 
will allocate a new LINK structure and assign it to the NEXT 
field of L. This statement will result in the configuration 
shown in Figure 5.2. 
A field of a structure may refer to the structure in which it 
is embedded or to any part of the configuration. Thus, 
continuing 

NEXT (NEXT(L)) = L 


will produce the configuration shown in Figure 5.3. 
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Cs eee A AN 
| | 
| LINK| Se R, 
SSS | 
| Dg | 
fal v 
EE. 
| 11111111117 | 
H 
NEXT [LINKI RR | —______, 
HA | 
VALUE | I | 22 i l 
A AS | v 
ee sree er RENE. | 
| 11111111117 | 
SSS st 
NEXT | S | 0 | 
SS E, 
VALUE | S | 0 | 
Ld c 
Fiqure 5.2 


There is no intrinsic limit to the number of fields of a 
structure or to the number of new datatypes that may be 
created. 


It is sometimes required that we obtain a pointer to one of 
the fields of a structure. This we may do by use of the unary 
name operator. Thus 


p————————— 
|| | 
| LINK | K——  —— > EE E —__—_— _— _  —_— __________ QQ 
a —Àà E 
(111 1 1 | 
Load vv | 
AA, | 
| 11411111111 13 | 
-———————————-4 | 
NEXT {LINK E | —————AÀ | 
tkr | | 
VALUE | I | 22 | | | 
AS ARES ES v l 
AAA | 
1141141111117 $ | 
——r——————À | 
NEXT |LINK| *— 1————3À 
I————————-— 
VALUE | S | 0 l 
IN AA 


Figure 5.3 


L = LINK() 
ALPHA = .NEXT(L) 


will result in the configuration shown in Figure 5.4. 


eS Se | 
| | 
ILINK| A |a 
p eM M4 | 
IL | 
L———J | 

| 

| 

| 

( 

Í 

v 
CARA AAA | fo ne ete | 
| | p SIISSIIITTT N 
EI— —T——————— H——_ r——— ———À 
IN | KZ | >) NEXT | S | 0 
AAA AA l 
| ALPHA | VALUE | S | 0 I 
ted | — eee | 
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The datatype indicated for ALPHA is 'N' for NAME. We may as- 
sign any value to the variable whose name ALPHA contains, by 
using the unary $ operator. For example: 

$ALPHA =  LINK() 


will result in the configuration shown in Figure 5.5. 


SE E A 
i ! 
(LINKI EE AA 
SSS ! 
ILI | 
| es | | 

| 

| 

v 
Cn ee ee EU See ee ee 
i i | 14411144447 N 
IN I *——|———» NEXT |LINK| a eatin 
AAA E —————— —3À 
| ALPHA | VALUE | S | 0 
LIIlIÍIII— | A I ————À 


v 
a a ee 
| 11111111111 1 
H—— ————34 
NEXT | S | 0 
HA 


VALUE | S | 0 
| es O 


Figure 5.5 
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Two different datatypes may have the same field without fear 
of collision. Thus 


DATA (' TN (VALUE, NEXT, LSON, RSON) ') 


will define a new kind of data called TN (for Tree Node). 
Executing 


T = TN(16, LINK()) 
NEXT(NEXT(T)) = .T 


will result in the structure shown in Figure 5.6. 


ES 
1 l 
c———> | TN | *—— |————34À 
aes, Cane l 
IT| l 
La { 
l 
l 
v 
| 44/47/141/1/1/177 | 
SS 
VALUE | I | 16 | 
NEXT [LINKI *——— | —— 
O E. | 
LSON | S | 0 TEN 
! 
RSON | S | 0 | | 
A s i 
l 
| 


PAZ AN 

| - 11111111111 TM 
hh 

NEXT | N 1 ae eee 


I ——————34 
VALUE | S | 0 | 
—— — 


Hd 


Figure 5.6 
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SSeS 
{| Program || The function READL(P) will read ina se- 
N 5.1 ii quence of items, placing them in a list, and 
E READL E return the head of the list. P is a pattern 
 _A AA to indicate the end of the list. If P is 


null (or equivalently, absent) the list is read in until an 
end-of-file condition is encountered. Otherwise, it will stop 
reading when the pattern match succeeds. It will not include 
the card matched. Thus READL(POS(0) 'STOP') will read a se- 
quence of strings up to but not including the first string 
having the word 'STOP' in column 1. 


DEFINE ('READL (P) N,S*) 
DATA (!* LINK (NEXT, VALUE) *) : (READL END) 


Cer IX M NEST DG QM MICE M M ———— p c "Rc c LECCE IM CD C LC CC CC ge NM LEE 
| Entry point: If P is null, make sure the pattern will | 
| fail. | 
A E O A A E AAEE AN O O PRU A AA E AE 
READL P = IDENT(P) ABORT 


EE E Fe ey O OA ARIS TRE ITE TEES EE DE DU E T EUR AE E TU TT ETE I RE a E ET RT e T 
| N will be the name of the variable to receive the next | 
| LINK of the list. Initialize it to point to READL. | 
p—— "——U— Po—— — UR V EC I "x — POT À———————— cán 9ÀÓ—I| 


N = .READL 


O E DC KM MEL DC ERE, | 
{ Top of loop: Read a card; try the pattern; append the | 
( LINK; and update N. | 
| IE LL M CERE E E O A E ME | 


READL 1 S = INPUT : F (RETURN) 

S P s S (RETURN) 

$N = LINK( ,S) 

N = .NEXT($N) : (READL 1) 
READL END 
os ee | 
(|! Program |! READRL(P) will read a list in reverse. That 
E 5.2 N is, the head of the returned list will  con- 
(| READRL E tain the last string read. The reversed read 
—————— is curiously easier to write (and keypunch) 


than READL and appears to be a more natural way of appending 
items onto a list. 


DEFINE('READRI(P) *) 
DATA ("LIST (NEXT, VALUE) !) : (READRL_END) 


A A ER E CMM IC [D CD C EC CMM E EM ALL I C M a Y 
{ Entry point: Set P; go through the loop inserting the | 
| latest LINK onto the front of the list. | 
an NO ec PE A NE A II 


READRL P = IDENT(P) ABORT 

READRL 1 S = INPUT : F (RETURN) 
S P :S (RETURN) 
READRL =  LINK(READRL, S) : (READRL_ 1) 


READRL_END 


Program 5.4 - LAST | ^ | ^ Page 87 


[x1 v — 

(| Program || REVL(L) will reverse a list L. The algorithm 
N 5.3 1 | works according to the diagram in Figure 
11 REVL E Sis Ta For simplicity the list elements have 
te been denoted by a single cell. Also, an ar- 


row impinging onto the outline of a cell represents a pointer 
to the data object and not a pointer to any particular field 
within the data object. REVL and L work their way down the 
list with L leading the way and REVL right behind. At each 
step the NEXT field of L is made to point backward to the 
value of REVL and then the 2 variables are incremented, so 
that they always span the 'gap' in the chain of links. 


DEFINE ('REVL (L) T!) 


DATA ("LINK (NEXT, VALUE) *) : (REVL_END) 
i ge CM MM CDI ECC ag ee E RS | 
| Entry point: Return L if it is not a link. Otherwise, | 


| initialize REVL and L to span the gap between the first | 
| link and the rest of the list. | 
A E IN | 
RE' L REVL = L 

IDENT (DATATYPE (L), 'LINK') :F (RETURN) 

L = NEXT(REVL) 

NEXT(REVL) = 


A ee eg ER 
| Go through loop making NEXT(L) point backward to REVL and | 
| walk one step forward (T is a temporary to hold NEXT(L)). | 
{| Quit when L becomes NULL. | 
A A A A | 


REVL_1 I DENT (L) : S (RETURN) 
T = NEXT(L) 
NEXT(L) = REVL 
REVL = L 
E um : (REVL_1) 

REVL_END 

Qe ee ae ee 

Program LAST (1) will return (by name) the name of 


1 | 1! 
B 5.4 N the last NEXT field of a list. Thus, if L1 
B 11 


LAST and L2 are lists 
LAA 
LAST(L1) = 12 
will concatenate the two lists. If the argument to LAST is 


null the function fails. Thus 
LAST(L1) = I2 : S(LAB1) 
Lt = I2 

LAB1 


will concatenate L2 to L1 even if one or both of the lists are 
null. Also 


LAST(L) = L 
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creates a circular list. 


Programs 5.5, 5.6 € 5.7 - PUSH, POP € TOP | . Page 89 


Pd 


DEFINE('LAST (I) *) : (LAST END) 


vine NN IC A A pe, ee ee ee ee M ELEME O 
{ Entry point: if Lis null, fail. | 
| ——————— ———————M—-XO"——————— (Ko ————————— ——— — —r——QÜÀÀ 
LAST IDENT (L) : S(FRETURN) 


Ws rece ars eye re LN NC IM dI MMC ENIM a ee oe Gee eg a ee I OG ADIRE M E C CAD E ECCE | 
| Seek a null NEXT field. | 
LI I MM TI MMC E EE a ae a ERE E) 


LAST 1 L = DIFFER(NEXT(L)) NEXT (L) :S(LAST 1) 


| ES GO CES MD VOCE A QM D MC EE A CE REDE pO DER MC LIAE EC MW ee LM C M ME I E C CI C MEME C aD T 

| Return the name of this field by name. | 

rc CMT C C ——" 
LAST = .NEXT(L) : (NRETURN) 

LAST_END 


Bay rum INE E 


11 Programs 11 These routines are stack manipula- 
(| 5.5, 5.6 6 5.7 E tion routines. As their names sug- 
(| PUSH, POP € TOP ff gest PUSH and POP are used to 
—€———————M respectively put on and take off an 


item from a stack. TOP is used to examine the last element of 
a stack without modifying it. Thus 


PUSH('ABC') ; PUSH(3) 
will push 2 items onto a stack. 


K1 
K3 


TOP () 
TOP () 


POP() ; K2 
POP() ; K4 


will assign to K1 the value 3, to K2 the value 'ABC!, to K3 
the value ‘ABC! and will not modify K4 as the calls to TOP and 
POP fail when the stack is empty. As an added bonus, TOP and 
POP will return by name. In the case of TOP, this means that 
values can be assigned into the top element. For example, 


TOP() = 'XYZ' 


will change the value at the top of the stack. PUSH returns 
the item pushed; more exactly it returns the field bearing the 
item last pushed. Hence, 


PUSH() = S 


has the same effect as PUSH(S). Having been written in this 
way, PUSH can be used to push matched substrings of a pattern 
match onto a stack. For example, 


S P1. *PUSH() P2 . *PUSH() 


is a pattern matching statement which, if the match succeeds, 
cause two substrings to be pushed onto the stack. We will re- 
quire this property of PUSH in the chapter on compiling. See 
L ONE, Prog. 18.2. 


DEFINE ( ' PUSH (X) *) 
DEFINE ("POP () *) 
DEFINE ( ' TOP () *) 
DATA (* LINK (NEXT, VALUE) ') 
: (PUSH. END) 


Rg ee a ace meg AO O M KM CRDI C CO Se ee Te 
| Entry point for PUSH: Just allocate a LINK and put it at | 
| the head of the stack pointed to by the global variable | 
| PUSH POP. Then return the VALUE field by name. | 


| ——————————————————— Ó— Ó—m! | 
PUSH PUSH POP =  LINK(PUSH POP,X) 
PUSH = .VALUE(PUSH POP) : (NRETURN) 


E UIN ML UP ie AU ULTIMI ee 
| Entry point for POP: If the global stack is null, fail. | 
| Otherwise return the element and pop the stack. | 
CC ———— ————Ó——————— MÁ——— '—"————— ÓÁ—— — | | 


POP IDENT (PUSH POP) : S (FRETURN) 
POP =  VALUE(PUSH POP) 
PUSH POP =  NEXT(PUSH POP) : (RETURN) 


| cC ages ge Meee A es ya pe PDA LIC crc ES ye CE ER 
| Entry point for TOP: Return name of VALUE field by name. | 
{ Fail if none exists. | 
A A A LO O DO DI O A 


TOP IDENT (PUSH_POP) : S (FRETURN) 

TOP = .VALUE(PUSH_POP) : (NRETURN) 
PUSH_END 
AAA AN 
{{ Program |! COPYL will copy a list. It makes use of the 
E 5.8 ii built-in function COPY which can be used to 
E COPYL B copy structures (as well as arrays). Hence 
MMM if a list is a chain of LINKs then COPY will 


be used to copy each LINK in turn. If it should happen that 
the VALUE field of a list points off to some other list, then 
a recursive function call is used to copy this subsidiary 
list. No difficulty follows from this simple procedure unless 
the data configuration has loops. If one of the fields points 
back to a node which has already been copied, we need not, and 
in fact must not, make a new copy of this node. Hence we must 
find a method to indicate which nodes have already been 
visited. This problem is not unique to COPYL. It arises 
whenever we wish to process every node of a data configuration 
with loops. we solve the problem here with tables. Another 
method, one involving marking the structure itself is 
described in VISIT, Prog. 5.10. 


To avoid marking structures, we keep a list of all items al- 
ready copied paired with copied counterparts. This is most 
easily done with a SNOBOL4 table. A table is similar to an 
array except that the subscripts are not restricted to in- 
tegers but may be any value. Thus 


TBL = TABLE(100) 
TBL<X> = Y 
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will assign the Xth element of TBL the value Y, no matter what 
the datatypes of X and Y are. The value of 100 is an estimate 
of the number of items to be placed into the table. Thus, a 
table is a kind of associative array. It is implemented as a 
collection of descriptor pairs. When items are entered or ex- 
tracted, a search must be made for the subscript. In SPITBOL 
the value is hashed so that the search is fairly rapid. In 
SNOBOLU the search is linear but is not all that slow because 
Only descriptors need be compared. In both languages the 
search is quite rapid for small tables. 


In our particular application we are interested in the case 
where X and Y are structures. If L is a LINK then 


TBL<L> = COPY(L) 


will associate with that particular LINK a copy of that LINK. 
In this way, we not only mark that a LINK has been copied but 
we point directly to the copied LINK. 


All this suggests allocating a table when COPYL is first cal- 
led. But, if COPYL is called recursively, we do not want to 
allocate a new table but rather retain the old one. This can 
be done in several ways. Two functions may be defined COPYL 
and COPYL INT. COPYL will receive control from external  sour- 
ces; COPYL INT will be called internally and will not allocate 
the table. 


Another approach, one to be used here, does not require that 
another function be defined. Rather, the COPYL function is 
redefined, by itself, twice, once immediately after receiving 
control, and once immediately before returning. 


IN MECIN EE CC E CM NER 
| COPYL(L) will copy a list of LINKs. The configuration may | 
| have loops. | 
A ——— Á————— ————-—R—————— E E | 
DEFINE ('COPYL (L) T*) 
DATA ('LINK(NEXT,VALUE) ') 
: (COPYL_END) 


E | 
| Entry point: Redefine COPYL to have a new entry point and | 
{ in which T will be treated as global. { 
AS | 


COPYL DEFINE('COPYL(L)', 'COPYL 1') 


[7 —————————————— H——ÓÉ—————ÓÉÓáÓÓ———ÁÓ 
{ Allocate a table and call COPYL. 100 is the estimate of | 
( the number of nodes in the list | 
———————————————————————————A——ÉÁ ——— Á—————— — — !———X !!—————9—— | 
T = TABLE(100) 
COPYL = COPYL (L) 


Gg ey ae a a D M CIC CM E | 
| We are done! Redefine COPYL to the original definition | 
{ and return. | 
A SS A A A A A E ES O E | 


DEFINE ('COPYL (L) T*) : (RETURN) 


A a, | 
{ Internal entry point: If L is not a link there is no need | 
( to copy it. Just return L. | 
AAA LM CORE EC a SE E NESEN E E E a a | 
COPYL_1 COPYL = L 

IDENT (DATATYPE(L), 'LINK!) :F (RETURN) 


Gr ee ape er MM eee rg RO | 
| Have we ever copied this LINK before? If we have, just | 
| return the copied LINK. | 


[SESE ESFERA E A | 


COPYL = T<L> 

DIFFER (COPYL, NULL) : S (RETURN) 
o DM eee ee ee LE AG MM dM LE DNA REESE, 
| otherwise copy the LINK and indicate this fact in the | 
| table. | 
| ———— —  — H— II A A 

COPYL = COPY(L) 

T<L> = COPYL 


O E EUM ED E alge EM CI A 
( Now copy the 2 fields. | 


A A M MMC EIE E M Cs ael 


VALUE(COPYL) = COPYL (VALUE (L) ) 

NEXT(COPYI) = COPYL(NEXT(L) ) : (RETURN) 
COPYL_END 
SSS 
{{ Program 1! FLD(ST,I) will return (by name) the Ith 
11 5.9 E field of the structure ST, failing if I ex- 
N FLD E ceeds the number of fields in the structure 
M —————À ST. It is written using 2 built-in func- 
tions, APPLY and FIELD. APPLY may be used with arbitrary 
function names as well as with fields of a structure. Note 


that APPLY returns by name (where applicable) and also note 
that FIELD requires a datatype, not a data object. 


DEFINE ('FLD (ST, I) ') : (FLD_END) 
FLD FLD = .APPLY(FIELD(DATATYPE(ST), I), ST) 
+ <S (NRETURN) F (FRETURN) 
FLD_END 
Se NN 
Il Program I| VISIT will visit every structure of a con- 
Na 5. 10 | figuration, once and only once, calling 
E VISIT (| PROCESS(ST) upon arrival, where ST is the 
————— structure visited. PROCESS represents some 


activity to be carried out and is left to be defined by the 
user. 


COPYL, in the process of copying a configuration, had to visit 
every node and we could let that function serve as a model 
from which to write VISIT. The only basic difference would be 
that, in COPYL, we knew the kind of structures we were dealing 
with and so we could reference the fields by name. In VISIT, 
the structures are arbitrary and so we must use a function 
such as FID to sequence through every field. 
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But we will depart from the COPYL method in two other ways. 
In the first place, we would like to present a method which 
avoids recursion. In many languages recursion is either 
unavailable or inefficient. Also, recursion, if carried to 
too many levels, will result in stack overflow. Also, we would 
like to present a method of marking structures which does not 
depend on tables. 


The algorithm, to be presented, was discovered independently 
in 1965 by Deutsch and Schorr and Waite; see Knuth [Vol.1, 
p.416-417 ]. It was developed in connection with garbage col- 
lection.  Cne phase of garbage collection is the marking phase 
when every structure which can be accessed is marked. Subse- 
quent phases insure that the marked structures are saved and 
the unmarked structures discarded. Avoiding recursion when 
garbage collecting is highly desirable if the recursion stack 
is sharing collectable storage. 


The algorithm works as follows. SON initially points to the 
root node of a tree as indicated in Figure 5.8(a), and the 
node is marked with a 1 (also shown in the figure). All poin- 
ters in the structure are examined to see if they point off to 
any as-yet-unmarked structure. If an unmarked structure is 
found, it is regarded as the new SON and the old son becomes 
the FATHER. If, in the new son, there is a pointer off to an 
unmarked node, the SON and FATHER descend another level. The 
pointer which had been used to point downward in the tree is 
redirected upward so that it is possible to determine from 
whence we came. The situation is depicted in Figure 5.8 (b). 
Note that FATHER and SON span a 'gap* in the structure created 
by our backward pointer. This is similar to REVL. 


The backward pointers permit us to crawl back up the tree when 
we are through examining all the descendants of SON. The MARK 
serves also the purpose of denoting which field is being used 
as backward pointer. For example, Figure 5.8(c) shows the 
situation a little later in which a mark of 2 on the grand- 
father indicates that the 2nd field is pointing to the great- 
grandfather. 


When we are done, all the marks will have been set positive. 
We cannot make all the marks 0 again using our VISIT function 
but we can make them all negative by setting SIGN = -1. VISIT 
will work properly if the initial value of the marks is < 0 so 
that this procedure can ke used to restore the state of the 
configuration to one which will accept subsequent VISITs. 


We could use a table to record the marks, as we did with 
COPYL. However, a more efficient method would be to add a MARK 
field to each data structure. For example, to add a MARK field 
to the LINK data type we could execute 


DATA ('* LINK (NEXT, VALUE, MARK) ') 


It is rather remarkable that we may substitue this DATA call 
for the [TATA call 
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DATA ( 'LINK (NEXT, VALUE) ') 


in just about any program without modifying its behaviour. But 
it is at least inelegant, and perhaps impractical, to request 
users Of VISIT to add a MARK field to every structure. Hence 
we will do this for him by redefining the DATA function. The 
new data function will capture control of each call to DATA, 
insert a MARK field, and then call the old original DATA 


function. 


If the user is using the FIELD function, as we do in FLD, he 
may inadvertently sequence into the MARK field which is sup- 
posed to be kept invisible. But we can keep him out of the 
MARK field by redefining the FIELD function. 


We OS pan A RE Oe gee Te ey A CN LM ee ee ee ee ee 

| VISIT (ST) will visit every node of the configuration | 

| headed by structure ST. Visitation consists of calling | 

| PROCESS (ND) where ND is the node. VISIT(ST,-1) will reset | 

| the marks. | 

Gn ee ee ee ee ee ee 
DEFINE ('VISIT (SON, SIGN) FATHER,GS,GF,DT,I') 


E E E ER 
| Redefine the DATA function so that a MARK field is inser- | 
| ted into each new datatype. | 
— M -—te——————ÀÁ—— —— —— -—— ee A A € 


OPSYN('OLD DATA', 'DATA') 


DEFINE ('DATA (S) *) : (DATA END) 
DATA S ‘yt = ',MARK)' 
OLD DATA (S) : (RETURN) 


DATA END 


Se CMM MICI M MI Cc (C C KE EMGEE 

| Redefine the FIELD function so that the user won't know | 

| about the MARK field. | 

— —— Ó«"——————————————— ———— e ———'— e E ESA | 
OPSYN (‘OLD_FIELD', 'FIELD'!) 


DEFINE ('FIELD (DT, I) *) : (FIELD_END) 
FIELD 

OLD FIELD(DT,I + 1) : F (FRETURN) 

FIELD = OLD_FIELD (DT, I) : S (RETURN) F (FRETURN) 
FIELD_END 


NS ERR IS eh ae | 
| Initialization section for VISIT: STND_DT will match a | 
| standard datatype. { 
a ee ÁQM—MMÀ—ÀPÀÀsÍ€€ ÀÓMPÀ——ÀA 


STND_DT = POS(0) ('STRING* | "INTEGER! | 'REAL' 
+ | ‘PATTERN! | ‘ARRAY! | ‘TABLE | 'NAME' | 
+ "EXPRESSION! | 'CODE' | *EXTERNAL') RPOS (0) 


:(VISIT END) 


D DIC Fe MMC ECCE MI CCCII CENE a eee ae ee eee INI PE A DP PME CENE SON 
| Entry point for VISIT: The default value for SIGN is 1. | 
| If the datatype of the node is standard (i.e. not | 
| programmer-defined), just return. | 
(oc ce E NS uU ee E Lt 
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VISIT SIGN = EQ(SIGN,0) 1 

DATATYPE (SON) STND_DT <S (RETURN) 
Gia oes Fs a a ee eS en E ARA A 
( Control flows to VISIT 2 whenever a previously unmarked | 
| SON is found. Here it is processed and marked and I is | 
| initialized. | 


 ————————H—— A ——— —— A | 
VISIT 2 PROCESS (SON) 

MARK(SON) = SIGN 

I = 0 


a aa eae A 
| Examine the Ith node of SON (GS means grandson). If GS is | 
| an unmarked structure, fall through. Else, loop. If no | 
| more grandsons remain, go to VISIT_3. | 
p — Ó—————— -—————— l 
VISIT 1 I = I«* 1 


GS = FLD(SON, I) :F(VISIT 3) 
DATATYPE (GS) STND DT :S(VISIT. 1) 
GT (SIGN * MARK (GS), 0) :S(VISIT 1) 


pU UE -— Puer 
| Mark the SON with the current value of I so we can pick up | 
| later where we left off. Point back to FATHER rather than | 
| forward to GS. | 
e————ÁÁ— H—————'Á€v——— — ——— — — — ——ÀÁ—Ó 
MARK (SON) = SIGN * I 
FLD(SCN,I) = FATHER 


E E E UU Ter S E I a ea ae E Re ee ee ee 
| Descend down one level; then go back to PROCESS and MARK | 
I| the new SON. | 
| ——— ——"—————————————————— ————————YÓ A A A —Á Ó—" | 
FATHER = SON 
SON = GS : (VISIT 2) 


en ge Ee II CS ME CE ARCET I DM CD ICM CCS LEM ep ee EMO E E | 
| Here if no grandsons are left. If FATHER is null we are | 
| done. Otherwise set GF to be the grandfather. | 
A are ——'———— — cis s-wave san oe ou ee a em 


VISIT 3 IDENT (FATHER) :S (RETURN) 
I = SIGN * MARK(FATHER) 
GF = FLD(FATHER,I) 


rn Ep C CPC Tc CO ee pee ae E O a N 
{ Point back toward the SON. Then hoist up one level. | 
| ————————————— ———————————— O | 


FLD (FATHER, 1) = SON 
SON = FATHER 
FATHER = GF : (VISIT. 1) 
VISIT END 
Names referenced Name Type Where defined 


by VISIT: FID Function Program 5.9 


___Exercises for chapter 5.0... = . Page 97 


2140212292292 12212117??712212102?22?2?72011 222 ?72?2 0722222121222 227? 21? 


. €« 9 «9 9 E 9 * ÓàÓ à ec «e 0o 6 6 . . 9 « *« . 9 . 9 * 9 € 9 € 9 9 à o6 9 9 $9 9. 9 9*9 9 9€ e © 4 e . . . e 


| Exercise 5.1 | Rewrite CRACK(S,C) (Prog. 4.1) to return a 
3» linked list of strings rather than an array 
Of strings. 


CoE crt 

| Exercise 5.2 | A doubly-linked list is one in which, in ad- 
t———————— dition to a NEXT field pointing to the next 
item on the list, there is a PREV field pointing to the 
previous item on the list. Let L be a link of such a list. 
Write code to remove the link from its list. 


ot 

| Exercise 5.3 | Write a routine FIRST() which will remove 
t———— (and return) the first item on the push-down 
Stack maintained by PUSH and POP and fail if no such item 
exists. Do this (a) without modifying PUSH and POP and (b) 
modifying PUSH so that the process of getting the first ele- 
ment is more efficient. 


po 
| Exercise 5.4 | Modify COPYL so that it copies a configura- 


AY» tion composed of structures of arbitrary 
types. 


cosa DIC, Cx 
| Exercise 5.5 | As indicated in the text, the assignment 
AY LAST(L) = L will create a circular list. 


What modification to REVL (Prog. 5.3) is required to reverse a 
circular list (the node returned should ke the node originally 
given). 


oS AN 

| Exercise 5.6 | Write a routine DISPLAY (I) which will 
AS display a data configuration headed by L. 
The type of structures in the configuration may be dissimilar 
and arbitrary. 


Ce ee DEED | 

| Exercise 5.7 | Write a function called IFFLD(N,S) which 
CA will serve as a predicate to determine 
whether N is the name of a field of the structure S. The body 
of the function requires two statements. 


qoe 

| Exercise 5.8 | Modify DATA and FIELD  (subfunctions of 
WM) VISIT, Prog. 5.10) so that every structure 
created will have not one but two additional fields MARK and 
THREAD. Moreover, arrange to sieze control at each request to 
allocate a new structure so that all structures will be 
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threaded together via the THREAD field. Rewrite VISIT so that 
by chaining down the THREAD field, the MARK field of each 
structure is initially set to 0. 


| ng AIL MD LEM | 

| Exercise 5.9 | How would you modify VISIT (Prog. 5.10) in 
AV) order to copy an arbitrary configuration? 
(Hint: Add a field called NEW to every structure which will 
point to the copied version.) 


Ge a ee iss e 

| Exercise 5.10 | Two configurations are said to be isomor- 
>>» phic if there is a one-one correspondence 
between the structures of the configurations such that if two 
structures correspond (a) they have the same type, (b) any 
field of one structure that does not have a structure as value 
must equal the corresponding field of the other, and (c) if a 
field of one has a structure S as value then the field of the 
other must have a structure S! such that S corresponds with 
S'. Write a subroutine ISO(S1,S2) which will succeed if struc- 
tures S1 and S2 correspond in an isomorphic configuration. 
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{{ hat is a pattern? we have used patterns throughout 
(| the preceding sections of this book without cons- 
(//NN|. ciously evoking this question. Indeed it is perhaps 
{7 MI not strictly necessary to know what patterns are so 
ts ts long as one knows how they work and what they do. 
However, patterns play such an important role in  SNOBOLU 
programming and «they provide such a powerful facility for 
analyzing input data strings that a strong conceptual 
framework becomes necessary in order to derive clean and ef- 
ficient implementations, resolve complex and seemingly 
ambiguous issues and contrive reasonable extensions. 


It is tempting to suggest that a pattern is a set of strings. 
Thus 


P = "AB! | ‘At 


would identify P as the two strings 'AB' and 'A'. Continuing 
in this vein 


P = LEN(3) 


would be the set of all strings consisting of three characters 
and 


P = ARBNO(ANY('AB'!)) 


would be the set of all strings (including the null string) 
comprised of characters chosen from the set {A,B}. FAIL, of 
course, would be the empty set. 


But what would we make of the patterns POS(n), RPOS(n), 
TAB(n), RTAB(n), BREAK(s), SPAN(s), FENCE, and ABORT which 
cannot be uniquely identified with a set of strings. Thus 
POS(n) matches the null string when it matches but it doesn't 
match all null strings, only those at position n. If we iden- 
tified POS(0) with the null string, we would be forced to 
conclude that POS(0) = POS(1) which is nonsense. By a similar 
token, BREAK(s), when it matches, will match a string not con- 
taining a character of s but it cannot be said to match all 
such strings, only those followed by a character of s. Hence, 
although BREAK(s) can match a null string on occasion, it can- 
not be related uniquely to the null string. The strings that 
BREAK(s) matches are determined in part by the context in 
which the strings are emkedded and this is true of most of the 
patterns which cannot be related to string sets. 


Another difference between patterns and sets of strings is 
that a pattern, if it matches more than one string, expresses 
a preference between any two. Thus 


tAR! | tat 


implies that 'AB' is tried before 'A' and behaves differently 
from 
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UA! f '' Ap! 
x 
| £488 ATTERNS AND CURSORS | Patterns are more accurately 
AS thought of as recognition 
| 484% | processes operating on cursors. A cursor is a pair 
y $ (| (S,I) where S is a string called the subject and I 
| $ l is an integer marking a position in the subject. I 


CS is called the cursor position. A cursor points bet- 
ween characters (as opposed to at them) and therefore the cur- 
sor position ranges between 0 and the length of the subject 
inclusive. The cursor ('ABCDEF*,2) is depicted in Figure 6.1. 


ona r4 r3 r3 | 
TAY (EI ICI IDI IEI IFI 
Ls Us tas LJ LJ LI 
A 
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Figure 6.1 


A depiction of the cursor ('ABCDEF', 2) 


When a pattern is called upon to match, it is presented with a 
cursor called the pre-cursor and the pattern either matches or 
fails to match at that point. If it matches, there will bea 


portion of the subject matched. A pattern P can then be 
defined as a function whose input value is a cursor and whose 
output value is a sequence of cursors. For reasons which will 
become apparent later we will use backward notation (c)P or 
simply cP to represent the application of the pattern P to its 


cursor argument c. Hence we write 
CP = [oc,,C2, one ] 


We will use square brackets as above to represent sequences, 
reserving braces to represent sets and parentheses for other 
kinds of scope delimitation. 


For example, if the pattern ('CDE' | 'C*) is applied to the 
cursor position of Figure 6.1 we have 


(‘ABCDEF',2) ('CDE! | 'C') = (5, 3] 


In the above, the cursor position 5 stands as an abbreviation 
for the cursor ('ABCDEF',5) and similarly 3 is an abbreviation 
for ('ABCDEF*,3). This represents no ambiguity since the sub- 
ject does not change during a match. 
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We will use Y to represent the null sequence. Thus 
(*ABCDEF',1) ('CDE' | 'C') = Ø 


Two patterns are equal if they represent the same function. 
That is, if (c)P, = (C)P2, for all c then P, = Pg. 


Below are some examples of built-in patterns in SNOBOL4Y. L is 
the length of the subject string. When a cursor is used in an 
arithmetic context it is the cursor position that is implied. 
For simplicity, the sequence [c] is represented as simply c. 


c POS(n) = c ifn-c 

= f$ otherwise 
c RPOS(n) = c if n=L-c 

= © otherwise 
c TAB(n) = n if n2c 

= Ø otherwise 
c RTAB(n) = L-n if LBL-n 2c 

= Y otherwise 
c LEN(n) = ctn if cèn < L 

= @ otherwise 
(‘ABCDEF', 1) BREAK('TAF') = [5] 
('ABCDEF*,2) SPAN('CAT') = [3] 
('A(B())CD', 0)BAL = (1, 6, 7, 8] 
('ABCDE*,O)ARB = (0, 1, 2, 3, 4, 5] 


Note that in the above, most built in patterns have at most 
one post-cursor position. ARB and BAL are exceptions and these 
are regarded as having ‘implicit alternatives!. 


Unevaluated expressions within patterns may make their 
behavior vary during a match. Thus 


P = BREAK (*S) 
will succeed or fail depending on the value of S. Any such 


pattern is termed varying. For the duration of this chapter 
we will only be concerned with nonvarying patterns. 


c(P; | P2) = (cP) (cP2) (6. 1) 


where the right hand side indicates the concatenation of the 
two sequences. 
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TO define the concatenation of patterns we must extend the 
definition of pattern to operate on sequences of cursor  posi- 
tions. This is easily done: 


{Cae Coe © o o ] P = (CP) (CoP) eee (6.2) 
Note that the notation c,PcgP is ambiguous because it can mean 
either  ((c,P)cg)P or (c,P)(cgP) and so will be avoided. For 
completeness 
DP = Ø 


Pattern concatenation is defined as 


C(P, P2) =  (cP1)P2 (6.3) 
For example 


(*ABCDEF',2) (('CDE* | 'C*) LEN(1)) 


[5,3] LEN(1) 


[6,4] 
The pattern FAIL is defined as: 
(c)FAIL = @ 
for all c. Hence 
FAIL | P = P = P | FAIL 


for all P. That is, FAIL is the identity element under pattern 
alternation. Note that 


(c)NULL = c 
where NULL is the null string. This is the identity mapping 
for cursors and hence NULL is the identity element for pattern 
concatenation. That is 
NULL P = P = P NULL 
for all patterns P. 


A pattern may have a countably infinite number of post-cursor 
positions. For example: 


(Cc) SUCCEED = [Cy Cy, Cy o... | 


where the sequence goes on indefinitely. An infinitude of al- 
ternates, therefore, produces a well-defined pattern. Thus 


ARB = (NULL | LEN(1) | LEN(2) | +... ) 


may be regarded as a proper definition for ARB. Whereas the 
number of post-cursor positions of (c)ARB is bounded by the 
length of the subject and so is always finite, its finiteness 
is not in general a requirement that the pattern be well- 
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defined. A pattern whose sequence of post-cursors is finite 
for all pre-cursors is said to be finite. If there is at least 
one pre-cursor such that the list of post-cursors is infinite 
the pattern is said to be infinite. As usual, we will hold 
that if C is infinite then 


C. wm iG uU 
for all sequences C'. Thus 
SUCCEED = SUCCEED | P 

for all patterns F. 

It should not be here thought that the definition of pattern 
is to be restricted in any way to those patterns which are 
directly available via SNOBOL4 primitives or by combinations 
of simple operations such as alternation or concatenation. A 


pattern is any well-defined process which maps a cursor into 
cursors of the same subject. 


| % *$ ONLINFAR PATTERNS | ABORT iS a more pungent form of 
| ** $ FAIL. Whereas  (C)ABORT, like 
(€ £ Y (c)FAIL, contains no post-cursor positions (ABORT 
| 89 9$9* | always fails) ABORT differs from FAIL in that it 
( £ £ | causes an immediate halt of scanning. To include 


C -—————3 ABORT in the theory it is necessary to annex a new 
element which is the value of ABORT. We write 


(c)ABORT = ! 


t is called the abort symbol. When it is concatenated on the 
left of any sequence of cursors it yields itself. That is 


4 [Cae Cos eee ] = 4 


E = C A = [Cae Coe ee >» 


where C is a sequence of cursor positions, possibly infinite, 
possibly null, and % is either or Ø. Concatenation of ex- 
tended sequences is defined as 


D 
¢ 


(C151) (C252) C:ıC2ħ2 if ha 


Cads if M 


it is easy to see that the concatenation of extended sequences 
is associative (the left most abort symbol is the important 
one no matter how the sequences are grouped) so that 


(E, Fo) Ez = E, (E2 Eg) (6.4) 


We can extend the domain of patterns from mere sequences to 
extended sequences as follows: 
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AA AAA ETE E eee TE A SEI A NS RAE TED ED LAS ENIM QUI? UNE OEP AES EP SE ETD A 


(C XP = (CP) A (6.5) 
Note that (t)P = + 


An extended sequence which does not have a terminal abort sym- 
bol is called linear; otherwise it is called nonlinear. If 
for all cursors c, the value of (c)P is linear then P itself 
is said to be linear. 


The built-in pattern FENCE which matches the null string but 
causes an immediate halt of scanning (like ABORT) when backed 
into is defined as 


(c)FENCE = [c] ¢ 
p———————————————————1 
$€€9$ UNDAMENTAL PROPERTIES | The definition of  concatena- 


1% 

| $ AAA tion and alternation of pat- 
| £8% | terns given above (6.1) and (6.3) are still valid 
| % {| with extended sequences. It follows immediately from 
| £ ( the associativity of extended sequences that the al- 
CS  ternation of patterns is associative. That is 


(P, | Po) | P3 = P, | (P2 | P3) (6.6) 
We briefly introduced the notions of transformations and 
homomorphisms on strings in Chapter 3. It readily follows from 
(6.2) and (6.5) that patterns are homomorphic transformations 

on extended sequences. That is 
(Er E2) P = (Es P) (E2 P) (6. 7) 

From this it follows that 

E (P, Pg) = (E Py) Pe (6.8) 
Thus, if a pattern is regarded as a transformation on extended 
sequences, concatenation becomes function composition. It is 


an interesting fact that function composition is always as- 
sociative. Thus 


(P. P2) Pz = P, (Po P3) (6.9) 


Proposition  Concatenation distributes over alternation from 
the right. That is 


(P1 | Pg)P3 = Py P3 | Pa P3 (6. 10) 


Proof: The left hand side when applied to a cursor c will 
produce by (6.1) and (6.7) and (6.1) again 


((cP1) (cP2))P3 


= (cP,P3) (cP2P3) = C(P,Ps | P2P3) 
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Note that distribution from the left would depend upon 
E(P, | Po) - (EP,) (EPs) which is not true for arbitrary E. 
See Exercise 6.2. 


A pattern P is said to be monic if (c)P has at most one post- 


cursor. Thus "At | 'APR' is not monic but 'A' | "Bt is monic 
Since both alternands could not match at the same pre-cursor 
position. Also, FENCE is monic for although (c) FENCE is ct 


the abort symbol does not count as a post-cursor position. 
Note that if M, and Ms are monic patterns then so is their 
concatenation (M, Mo). 


Proposition If m is monic and linear then it distributes over 
alternation from the left. That is 


m (P, { Po) = mP, | MP > (6.11) 
The proof of this is simple and will be left as an exercise. 


Most of  SNOROLU's built-in patterns are, as has been 
previously noted, monic. The others are referred to as having 
implicit alternatives. If a pattern is composed only of monics 
then it can be decomposed into an alternation of monics as in 
the proposition below. This yields a kind of canonical form 
for patterns. 


Proposition Let P be any pattern formed by concatenation and 
alternation of linear monic patterns and ABORT and FENCE. Then 
P can be written 

In 4 As | me Ao | eee | Mn An (6.12) 
where each m(i) is linear monic and where each A(i) is either 
ABORT or NULL (the null string also serves as the null pattern 
and both differ from the null sequence, Ø). 
Proof: By induction, if P has only one element and since 

FENCE = NULL | ABORT 
P is of the indicated form. If P is of the form P, | P2 and 
both P, and Ps are in the form of (6.12), P is also. If P is 
of the form, P, Ps and both are of the form (6.12) we have, by 
right distribution 
P, P2 = m; A, P2 | ... | Mn An Pe 


Focus on only one term, for if we can show that each term 
reduces to (6.12), their alternation will. Consider 


m A Pe 


If A is ABORT, the value is mA and is of the desired form. 
Otherwise apply left distribution of m over P3. 


oe een ee oe 

| S% CANNING | In the normal unanchored mode of scanning 
| £ SY the cursor first presented to the pattern is 
| 88% | (Subject,0) and upon failure is presented with 
{ £ | (Subject,1) and so forth until the pattern succeeds. 
| SEE | That is, the effect of a pattern match is the first 
L.————J cursor position of 


(0 P) (1 P) ... (L P) 


if any. Here L is the length of the subject. The string 
matched is determined by the first nonempty (c P). Let (Ca 
P) be the first nonempty one. Let cy be the first post-cursor 
of (c, P). Then the string bounded by cy, Cə is the substring 
matched. For example, let the subject be 'ABC' and let the 
pattern be 'AB' y 'C'. Then the sequence 


(0 P) (1 P) (2 P) (3 P) 
is 
(2] Ø [3] Ø = (2, 3] 


The first pre-cursor position (0) and the first post-cursor 
position (2) determine the string matched ('AB'). 


If the pattern matcher is in anchored mode then the sequence 
of cursor positions of interest is only (0 P). 


(LITT 

( F£ RBNO | The function ARBNO(P) which may also be written 
[LX $ ———3 P* is defined as 

I% Ey 

| ESE | 

I% $| P* = NULL | P P* (6. 13) 
LIÉ ——J 


Since P* is defined in terms of itself we may well ask, is it 
well-defined? That is, does (6.13) specify one and only one 
pattern. The answer, as we will see, is yes, but the question 
is at least as intriguing as the answer. Will a pattern, in 
general, defined in terms of itself have a unique solution? 
the answer is, obviously, no since 


P = P 


will be satisfied by any pattern. Next, we might consider 
patterns having the same general form as (6.13), viz. 


P = Q, Q2 P (6.14) 
Will this always uniquely define P where Q, and Q> are given? 
The answer is no, for let Q, = FAIL and let Q> = NULL. Then 
(6.14) reduces to 


P = FAIL | NULL P = NULL P = P 
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Here, as before, there are an infinite number of solutions to 
the equation. As a less trivial example, let 


Q1 
Q2 


POS (0) 
POS (1) 


Then (6.14) has an infinitude of solutions of the form: 
P = POS(0) | POS(1) P' 


where P' is any pattern. (Note that POS(i) POS(j) is either 
FAIL if the arguments are unequal or POS(i) if i = 3.) 


For the special case that Q, is NULL, however, we have the 
following 


Proposition For any pattern Q the equation 
P = NULL | QP (6. 15) 
can be satisfied by one and only one pattern P. 


Proof: We will prove this by providing a procedure for com- 
puting the kth cursor position (if one exists) of (c)P for all 
c and for all k. Since (c)NULL = C, the first cursor position 
of (c)P is determinable for all c, viz. c itself. This forms 
the basis of an inductive proof. Suppose that we can compute 
the first k-1 cursor positions of (c)P for all c. In some 
cases there may not be as many as k-1 in which case we would 
know all of them and also how the sequence terminated (i.e. 
with an abort symbol or not). Then to compute the k th cursor 
position of (c)P we note that 


(c)P = c (c Q P) 
Letting (c)Q = [Cae Ce, --..] A we have 
(c)P = c (c,P) (CaP) -.. A 


Now all that is needed to compute the k th cursor of (c)P is 
to compute the (k-1)st cursor of (c,)P if it exists. If it 
does not and if the sequence is not terminated by an abort 
symbol, we reduce k-1 by the number of cursor positions in 
(c,)P and find the required cursor position of (cs)P. In this 
way the sequence (c)P can be effectively computed for all k. 


If the argument to ARBNO is monic and if ARBNO is anchored a 
kind of backup-free scanning results which can be useful for 
selectively scanning over portions of a string. For example, 


Q = eee ee 


S POS(0) ARBNO(Q BREAK(Q) Q | NOTANY(Q)) P 


will scan S for a substring not contained in quotes which will 
match the pattern P. 


A reasonable exercise at this point is to demonstrate that P 
is applied at all pre-cursors not within quotes. First note 
that the argument to ARBNO is monic and linear. Next we need 
a 


Proposition Let m be linear monic. Then 
ARBNO (m) = NULL | m (m? | m3 | ... (6. 16) 


where m? is m concatenated with m, m3 - m? m, etc. 


Proof: 
ARBNO(m) = mx 
= NULL { m m* 
= NULL | m (NULL | m mx) 
By (6.10) - NULL | m | m? m* 


By induction it can be shown that the ith term is m to the 
(i-1)st power. 


Given (6.16) it should be evident that the sequence of pre- 
cursors applied to P are monotonically increasing and are ap- 
plied at all points other than within quotes. 


As another example, PL/I comments are delimited by /* on the 
left and */ on the right. To match pattern P against a string 
not contained in a comment we can execute: 


S  POS(0) ARBNO('/*' FENCE ARB '*/'! FENCE | LEN(1)) P 

(6. 17) 

Even the most ardent SNOBOL4Y enthusiast will admit to being 

puzzled occasionally over the use of FENCE. It's double ap- 

plication in this example virtually begs for analysis. First 

note that any pattern of the form P FENCE | M is monic for all 

patterns P and all monic patterns M. Hence the argument to 
ARBNO is monic. For any pattern P we have 


(c)P = Ch 


The associated linear pattern, PL, sometimes called the linear 
part of P is defined as 


(c)PL = C 


The associated nonlinear pattern, PN, sometimes called the 


(c)PN = cy 


For example, the linear part of (ANY('AB') FENCE) is ANY ('AB') 
and, in general, the linear part of (m FENCE) for any linear 
monic m is m itself. The nonlinear part is NULL | m ABORT. 
The linear part of a monic pattern is monic. For example, the 
linear part of ('/*' | LEN(1)) FENCE is the monic pattern that 


matches '/*' if present or a single character if '/*' is not 
present. Note that 


(c) (PN PL) (C AMPL = (c PL) >» 


= C » 
and hence for all patterns P 
PN PL = P (6. 18) 


Note too that if PN is the associated nonlinear part of some 
pattern then 


FENCE PN = FENCE = PN FENCE (6.19) 
From (6.19) and (6.18) and associativity it follows that 
FENCE P = FENCE PL (6.20) 


for all patterns P. In what follows, let 


F = FENCE 
N = NULL 
A = ABORT 
As stated previously 
F = NY{A (6.21) 


For all patterns P, using (6.21) and right distribution 

FP = PJA (6.22) 
For all P 

PAJA = A (6.23) 


If Mis monic, it may easily be shown using (6.23) and (6.21) 
and right distribution that 


FMF = FM (6.24) 


Proposition If M is monic and if m is the linear part of M 
then 


F Mk = (F M)* = F (M F)* = F m* (6.25) 


Proof: To prove the first equality, by (6.22),(6.13), (6.22), 
and (6.24) 


F M* 


The last equation has the general form 
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P = N {FMP 
Since (F M)* also satisfies this equation we have by (6. 15) 
F M* = (F M)* 


TO prove the second equality, let M, = MF. M, is clearly 
monic. By the first equality 


FM,* = (F M,)* 
Replacing M, by M F and then using (6.24) we have 
F(M F)* = (FMF)* = (FM)* 


To prove the third equality, use the fact that F M= F m (see 
(6.20)) and the first equality to obtain 


(FM)* = (Fm* = F m* 


Let us return to our example of searching for a semi-colon not 
within comment delimiters. The pattern 


POS(0) ARBNO('/*' FENCE ARB '*/' FENCE | LEN(1)) P 


is of the form POS(0) ARBNO(M) P where M is monic. This fol- 

lows from the fact that any pattern of the form P FENCE | M is 

monic. Anchoring on the left with POS(0) is eauivalent to 

anchoring on the left with FENCE from the standpoint of global 

scanning. By (6.25) 
FENCE ARBNO(M) P FENCE ARBNO(MI) P 

FENCE (NULL | ML | (ML)? | ... ) P 


where ML is the linear part of M. We need only show that ML 
behaves properly. From its definition there are only 3 cases 
to consider at any given cursor position. 


1) The string '/*' appears at the cursor position and there 
follows a '*/' in the string. In this case the entire comment 
is matched by ML. 


2) The string '/*' appears but no following '*/' is present. 
In this case ML fails. 


3) The string '/*' does not appear at the cursor in which case 
a single character is matched. 


From this it should be clear that P is applied to all cursors 
in the order of increasing cursor position except within  com- 
ments or unclosed comment constructions. 


c TE 
| £888 ECURSIVE PATTERNS | A pattern P which is defined in 
(€ $ —————' terms of itself is said to be 
| #488 | defined recursively. In the investigation of ARBNO, 
| 9$ * | we have encountered the definition P = Q, | Qs P 
1 49 %& | where Q, and Qə were given. Even in this simple case 


Lv there were values for Q, and Qə which would lead to 
an improper definition for P even though the specific case of 
ARBNO led in all cases to a valid definition. The general case 
of recursive definition is of interest to the SNOBOLU program- 
mer because the language permits, via unevaluated expressions, 
arbitrarily constructed recursive definitions. For example, 
the SNOBOLU assignment 


P - NULL | 'A' *P 
assigns to P a pattern which will satisfy the equation 
P = NULL | 'A' P 


From Prop. (6.15) we know that P is well-defined and has a 
value according to (6.13) of ARBNO('A'!). 


More generally, if P is assigned the value f(*P), where f is 
some functional form, then the pattern so defined is the one 
which satisfies the equation 


P = f(P) 


It may be that no pattern or more that one pattern satisfies 
the equation in which case P is not well-defined. The scanner 
typically loops for not well-defined cases. In SNOBOL4 it is 
quite easy to write a recursive definition which has more than 
one solution. For example: 


P = xP 


has an infinite number of solutions. It is not quite so easy 
to find a recursive definition such that there is no solution 
to P. To do so we make up a primitive pattern function called 
NOT, defined as: 


(c) NOT(P) = c if (c)P 
= P 


There surely is no solution to the equation 
P = NOT(P) 


and hence the assignment P = NOT(*P) would lead to an ill- 
defined construct. NOT, however, is not a primitive facility 
Of SNOBOL4Y and, moreover, it is not known whether a recursive 
definition can be written in SNOBOLU which does not have at 
least one solution. 
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There are many ways in which a recursive definition can be 
poorly formed in SNOBOLU and these usually result in having 
more than one possible solution. Frequently the following 
principle is violated. 


Proposition Let A, B, C and D be patterns. If B does not 
match the null string or a string of negative length then 


P = A| BPC [D (6.26) 

has at most one solution for P. 
Proof: Let P, and Pg be different solutions to (6.26). Let S 
be a string which is matched differently by P, and Po. Let c 
be the cursor in S with the largest cursor position such that 
(C)P, 4 (C)P2;. Then 

(CA) (cBP,C) (cD) # (cA) (cBP2C) (cD) 

(CBP.C) ^£  (cBP2C) 
(CBP,) Xx (cBP>3) 

Then for some c' in the sequence (cB) we must have 


(c'P,) # (c'P¿2) 


But by definition of B, c' is greater than c which contradicts 
the assumption that c was greatest. 


(6.26) can be strengthened a great deal (See Exer. 6.20) but 
this simple statement is quite powerful. For example, let 


P = 'B! | tat P (6.27) 
Then by (6.26), P is unique. Now 


ARBNO('A') 'B' (NULL | 'A' ARBNO('A')) 'B' 


tp! y "At (ARBNO('A') 'B') 


This last equation is in the form (6.27) so that 
P = ARBNO('A') 'B' 
is the unique solution for P. 
If P is given as 
P = A| BP 

where B can match the null string we can frequently formulate 
a set of solutions for P which satisfy the equation. First we 
define IF(P) as: 

IF (P) = . NOT(NOT(P)) (6. 28) 


Then note that from the definition of NOT 
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NULL = NOT(P) | IF(P) (6.29) 


for all patterns P. It follows that for arbitrary patterns P 
and Q: 


P = IF(Q) P | NOT(Q) P (6.30) 
In this way we can decompose P into a number of disjoint al- 
ternatives from which we may analyze the behavior of P. Note 
from this last equation, since NOT(P) P = @, we have 
P = IF(P) P (6.31) 
For example, let P be 'defined' recursively as: 


P = LEN(!) | POS(0) P (6.32) 


By considering various disjoint situations we can reason out a 
behaviour pattern for P as follows: 


(c)P = [1, 1, ... ] if POS(0) LEN(1) would succeed 


(c)P = ct if NOT(POS(0)) LEN(1) would succeed 
(c)P = ? if POS(0) NOT(LEN(1)) would succeed 
(c)P = @ if NOT(POS(0)) NOT(LEN(1)) would succeed 


The question mark (?) indicates that at this set of conditions 
the equation merely says that P = P and so any pattern would 
do. Letting X indicate such an arbitrary pattern we have 


P = POS(0) LEN(1) SUCCEED | NOT(POS(0)) LEN(1) | 
POS (0) NOT (LEN(1)) X (6.33) 


We will let the reader confirm that any pattern of the form 
(6.33) is a solution to (6.32) noting that NULL { SUCCEED = 
SUCCEED, that P, | Po = Po | Py if P, is mutually exclusive 
with Ps and that POS(n) NOT(POS(n)) = FAIL. 


Patterns exhibiting left recursion present ambiguous condi- 
tions which are resolved when the scanner is in a mode known 
as QUICKSCAN (the default mode). Consider 

P = P tat | tBt (6.34) 


This equation has a solution P = ABORT. As we will see, 
however, in QUICKSCAN mode the pattern 


P = *P !A! | 'B" (6.35) 
operates as if it were defined as 
P = IBAA ... ' | ... | 'BAA! | 'BA' | 'B' 
where this indicates that P matches any substring equal to a 
'B' followed by an arbitrary number of 'A's matching alter- 


nates in the order of decreasing length. The reader may easily 
confirm that this value for P also satisfies (6.34). 
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This is implemented roughly as follows. When *P is called upon 
to match in (6.35) the subject is reduced (on the right) by 
the minimum number of characters required by *P's subsequent 
(1 character in this case). Hence recursive plunges are taken 
until no more characters remain which breaks the loop. Some 
of the details of this process are described in the next chap- 
ter. To establish the theoretical background for understanding 
this heuristic, first note that if A does not match the null 
string or a string of negative length, then for any finite se- 
quence C 


(JA C => C= (6. 36) 


This is easily seen by considering the smallest cursor posi- 
tion in C and an immediate contradiction results. 


Proposition If A does not match the null string or a string 
of negative length and if both A and B are finite linear  pat- 
terns then 


P = PAJB (6.37) 
has exactly one finite linear solution for P, viz. 
P = ... | BA3| BA? | BA | B (6. 38) 


Proof: We first note that (6.38) is well-defined if A must 
match a nonzero length string since we can discard all alter- 
nates other than the last L where L is the length of the sub- 
ject. Using (6.37) we oktain 


cP = (cPA) (cB) (6.39) 


If (cB) = Ø then, by (6.36), (cP) = Ø. Since (cB) is finite 
linear it may, by Exer. 6.6, be removed from both sides of 
(6.39). Letting C, be the result of this removal from cP we 
have 


C, = cPA = (C, (cB))A = (C,A) (cBA) 


Again, by (6.36), if cBA = Ø we have that C, = Ø. Otherwise 
we may remove cBA from both sides. Assume that Cə is what 
remains after removing cBA from C,. Then, as before 


Co = (CoA) (CBA?) 


this process eventually terminates with Cn = 9 and this is 
ensured by the fact that A does not match the null string. 
Hence we have 


cP = ... (CBA3) (CBA?) (CBA) (cB) 


from which we obtain (6.38). We conclude that the QUICKSCAN 
heuristic limits the solution space of (6.37) to finite linear 
solutions. On the other hand under FULLSCAN, (6.37) loops im- 
plying no such restriction on the solution space. 


E | 
| Exercise 6.1 | Which of the following are true? 
SS | 


a) UA! = tar | tA! 

b) 'A' | 'B' = ANY('AB!) 

C) ARBNO('A') = NULL | ARBNO('A!) 

d) BRFAK(S) ANY(S) = ARB ANY(S) 

e) tart [| tp! = Be | tar 

f) ANY('ABC') = NOTANY (DIFF (6ALPHABET, ' ABC!) ) 

g) FENCE (Pa | Po) = FENCE P, | FENCE P, 

h) ('AB! | ' DEF !) ('G' { yq!) = 

"ABG! | ‘ABH | 'DEFG' | 'DEFH! 

i) ARB =  AARBNO(LEN(!)) 

j| (P, | Pg) FENCE = P, FENCE | P, FENCE 
E UL 
| Exercise 6.2 | While pattern alternation is defined as 
LA eee | 


(c) (P4 | P2) 


it is not in general true that 


(c)P, (c) Pa 


(C) (P, | P2) (C)P, (C)P2 


where C is a sequence of cursor positions. Find a counter- 
example. 

po eer wo ee | 

| Exercise 6.3 | Reduce the following pattern to canonical 


A form 
("Bt | 'R') (*E' | 'EA*) ( 'D' | 'DS'!) 


Is the pattern monic? 


E | 

| Exercise 6.4 | In semigroup terminology a left zero z is 
2 defined as an element such that z e = z for 
all elments e of a semigroup. What is a left zero for a) the 
semigroup of patterns with the alternation operator, b) the 
semigroup of patterns with the concatenation operator, and c) 
the semigroup of linear kut possibly infinite cursor sequences 
under concatenation? 


EA ee 
| Exercise 6.5 | An idempotent element E for an operator * 
AS has the property that 
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Which of the following are idempotent under concatenation? 


a) BREAK (S) f) NULL 

b) SPAN(S) g) FENCE 

C)  TAB(N) h) ABORT 

d)  POS(N) i) EN 

e) FAIL j) ARB 
ese eee oN 
| Exercise 6.6 | Let E, and Es be extended sequences and C a 
AV finite linear sequence. Show that any C is 


left and right cancellative, where left cancellative is 
defined by a) and right cancellative is defined by b). 


a) C E, = C Es => E, = Eo 
b) E,C=E,C => E, = Ep 


Show that arbitrary E are not cancellative by finding an E, E, 
and Es such that 


c) E E, 


E Eo but Es # Eo 
a) E. E = E; E but E, # Es 


Demonstrate that if pattern R is finite, linear, then for any 
two patterns P, and Pg 


e) R | Py 


RI P9 => Py = Po 


f) Pa IÍ R Pa I R => Py = Pg 
[Sa ee 


| Exercise 6.7 | What are the first five alternands in the 
A expression: 


ARBNO (ARBNO (LEN (1) ) ) 


r— 


EE 
| Exercise 6.8 | Show that if M is monic and P is merely any 
AS pattern, then 


P FENCE | M 
is monic. 
[7 ———ÓÓRM 
| Exercise 6.9 | Let P = ARP ARB. Let L be the length of the 


SY Subject. How many post-cursor positions are 
there in (0)P? 
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Conc eee ee eee 
| Exercise 6.10 | Show that the pattern matching statement 
| ————— 


Subject POS(0) Pattern 
is equivalent to the statement 


Subject FENCE Pattern 


Gost ee ee 
{| Exercise 6.11 | Let 
AAA ES 


P = ARBNO(LEN(1) ARB) 


How many post-cursor positions are there in (0)P where the 
size of the subject is L characters? 


Ce ee AN 

| Exercise 6.12 | Prove that if m is linear monic then m(P, | 
AS Pg) = mP, | mPg. 

nr” LÀ LÀ e 

I| Exercise 6.13 | Which of the following patterns are neces- 


t———— sarily monic? 


a) BREAK('ABC') e) P | ABORT 

b) POS(0) | RPOS(0) f) FENCE P 

c) ANY(S) | BREAK (S) g) P FENCE 

d) POS(N) | TAB(N) h) FENCE | FENCE 
Gr ese fe ap UR 
| Exercise 6.14 | Augment the pattern shown in (6.17) to skip 
AM Over material in quotes ('...') as well as 
within comments. Make sure that characters within unclosed 


quotes are also passed over. 


ee ee PU ee a A 
| Exercise 6.15 | Let P = ARBNO('A* ARB 'B'!). What is the 
AAA sequence of post-cursor positions for 


a) (‘AB',0)P 
b) ("ABAB', 0) P 
c) (DUPL('AB',K),0)P 
| MESURE Se ee 
| Exercise 6.16 | Using the technique of Exercise 6.14, write 


AÑ a pattern which will scan for a PL/I state- 
ment failing if none exists. 


Ce gg ee vs 
| Exercise 6.17 { Furnish a counter-example to the following 
t — MM | 


ARBNO(P) = NULL | P | P? | P3 | ... 
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Cae ee or eg e 

| Exercise 6.18 | Using back-up-free scanning, write a pat- 
t————————— tern which will print out all SNOBOLU iden- 
tifiers in a string of SNOBOL4 source. Identifiers within 


quotes should not be printed. It will be OK to print out the 
S and F of GOTO's. For example 


ALPHA = 'ABC' B("X") : S (SAM) 


should print the strings 'ALPHA', 'B', 'S' and 'SAM'. 


pU TUS LUTTE 

| Exercise 6.19 | Let PL, and PL be the associated linear 
t-———— patterns of P, and Pg respectively. Provide 
a counter-example to the conjecture that PI, | PL; is the as- 
sociated linear pattern of P, | Po. 

Ld DRM RERO E E Se 

| Exercise 6.20 | Let f(P) be an expression involving P com- 


LLLLL———————————AJ4 posed of constant patterns, alternation and 
concatenation. Show that f(P) can be written as 


A, | By Pf1(P) | A2 | B2 P f2(P) | As ... 
... An | Bn P fn(P) | A 


where A, Aq, Ag,...,An, Bae Bog, «e+ „Bn are patterns not in- 
volving P and f,, fo, ... ,fp are functions. From this, show 
that if B,, Bo, ...,Bp do not match the null string and if no 
pattern primitive matches a string of negative length, then 


P = f(P) 
has at most one value for P. 


fac GEM ee ee a A 

| Exercise 6.21 | Which of the following equations for P 
AS uniquely specify a pattern? If P is unique, 
give its value. Otherwise indicate a class of values (via X) 
which will satisfy it. 


a) P = RPOS(0) { BREAK(S) P 

b P = ANY(S) | SPAN(S) P 

c) P = ANY(S) | BREAK(S) P 

d) P = TAB(N) | POS(N) P 

e) P = TAB(N) | RT7AB(N) P 
EA 
| Exercise 6.22 | let P be a pattern not matching the null 


t————— string. Define P- recursively as 
P- = P P~ | NULL 


Show that P- is well defined.  P- is called the negative ARBNO 
of P. 


Let P be given as 


<=> a A a Ce oe eee ee ee a ED 


P = X | YP | 2 


where Y is monic and does not match the null string. Write P 
explicitly in terms of X, Y, Z and the two ARBNO'S. 
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it «tt hile it is not strictly necessary to know how pattern 
{I7\f{ matching is implemented in order to use  SNOBOLU pat- 
(/^/NN| terns, it is necessary to be somewhat aware of the 
IZ Ni implementation in order to program efficiently and 
ts ts well. This chapter is based on the internals of three 
independent SNOBOL4Y implementations,  MAINBOL, SPITBOL, and 
SITBOL. 


The compiler processes all statements in a uniform manner 
without treating the pattern-matching statement any dif- 
ferently (essentially) than any other statement. Every state- 
ment is compiled into a kind of Polish notation which may be 
visualized as a tree. For example the pattern 


(‘At BREAK('XY') | 'D') (ANY('ABC') | 'HA' | 'TA'!) 


is depicted in Figure 7.1. An empty box denotes concatenation 
and the compiler treats | as associating to the left. 


ra 
r—— — | Y —— MMMM] 
{ Ct | 
| | 
(c——3À c— 
(c———| | I-, ri | 1— 
| uy { ' uy i 
| l | | 
a l ra | 
r— | | ——— ED r—I | 1— TA” 
i uy | i us i 
| | | | 
1 | l | 
| | 
‘At | BREAK | { ANY | ' HA! 
--— MEME 
| I 
| | 
xy! ‘ABC! 
Figure 7.1 
The compiled form of 
(*A' BREAK('XY!) | 'D') (ANY('ABC') | 'HA' | 'TA!) 


Pattern matching operates by the concerted action of a set of 
built-in monic patterns called primitives. Strings used as 
patterns, and the patterns indicated by BREAK and ANY, fall 
into this category. Abstracting Figure 7.1 to the point of 
representing all primitives by single letters we arrive at the 


diagram in Figure 7.2. 
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(7^7 
posce) [M 
| cc | 
| I 
VT ^ 1 Lm | 
mI | I A E ae. 
| I | | 
l | | | 
ci I m4 | 
=s] |—— C elt kee F 
| l l ] 
| l I | 
l | | | 
A B D E 
Figure 7.2 
The abstract tree of the expression: 
(‘At BREAK(*XY') | 'D') (ANY('ABC') | 'HA' | 'TA*) 


This form or structure for the pattern is, however, not the 
most suitable for doing pattern matching. In Figure 7.2 if 
nodes A and B match successfully, node D is then attempted. 
But to obtain D the scanner must go up the tree to the top 
node and back down on the right hand side to find the primi- 
tive which is to be matched next. Since ancester information 
is not present explicitly in the compiled Polish prefix this 
tree walking would be prohibitively expensive. A similar thing 
can be said about the events which occur when a primitive 
fails. The information available from the tree, while com- 
plete, does not seem to be in a form most conducive to rapid 
search. Hence, when the expression represented by the Polish 
tree is evaluated, an entirely new structure is created. An 
example of such a structure is shown in Figure 7.3. A solid 
arrow drawn from a node X to a node Y indicates that if X is 
successful Y will be matched next. Y is called the subsequent 
of X. A dotted arrow from X to Y indicates that, if X fails, 
Y can be tried immediately with the same pre-cursor position. 
Y is then called the alternate of X. 


| *€*€*9 ATH DIAGRAMS | More formally, a path diagram is an in- 
1% £ m  terconnection of nodes. Fach node may 
| 88% | have a subsequent (indicated by a solid arrow) or an 
{| £ | alternate (indicated by a dotted arrow) or both. 
I £ ( Each node has an associated primitive which is a 


AS monic pattern. An s-vacancy is a node without a 
subsequent. An a-vacancy is a node without an alternate. The 
root of a path diagram is the node with no arcs directed into 
it. (It is easy to show that construction limits the number 
of root nodes to one.) 
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V^ 1 VC 1 
E EEEE EERE www RAN C |———À sau] F | 
. A | | . E | 
- | . 
: v - 
V 1 v^ 1 C771 | re | 
{ A | ——————» | B |———————»i D Dax» l E { 
LLLI Cd | ee | A | 
Figure 7.3 


The path diagram associated with Figure 7.2. 


The path diagram of a pattern consisting only of a primitive p 
is simply a node without subsequent and without alternate and 
with p as its associated primitive. The concatenation of two 
path diagrams D, Də is found by drawing a solid arrow from 
every s-vacancy of D, to the root of Ds. The alternation of 
two path diagrams D, | Də is obtained as follows: starting 
with the root of D,, search down the chain of alternates until 
an a-vacancy is found. Then draw a dotted arrow from this a- 
vacancy to the root of Ds. 


It is interesting to note that the operations of alternation 
and concatenation of path diagrams are (like patterns) as- 
sociative. Hence path diagrams form a semigroup under these 
two operations. 


The pattern node contains four essential fields as indicated 
below (one more field is introduced later). 


Ce ee en ee ee | 
PROG [program address] 


E-———————3À 
SUBS | subsequent | 
ml 
ALT | alternate i 
| 
ARG { argument | 


A | 


To describe the pattern matching algorithms in SNOBOL4 we 
would declare a structure of type NODE as 


DATA (*NODE (PROG, SUBS, ALT, ARG) !) 
Then, to allocate a node for, say, LEN(13), we may execute 
NODE ('LENP',,, 13) 
where the label 'LENP' indicates the location which handles 


the LEN primitive. Its encoding would be the machine language 
counterpart of the following SNOBOLU statements. 


PATH DIAGRAMS Page 125 


EES ONULO CEES E rr A EL) EE EEE RE CD UND EE co ED ee GDSEI GE AS ES EGE ED cee EE GE EE OE eR GE ED a 


| Is the number of characters remaining in the SUBJECT > | 
( ARG(NODE)? If not, fail! | 
————— — a ———————— m —————— | 
LENP GE(SIZE(SUBJECT) - CURSOR, ARG(NODE)) :F (F) 


Ey pee M ICM CC M CQ CASO SCC C CMM ÉCLAIR E EAD SUCI ESO MCCC JN DE | 
| Otherwise compute the post-cursor position and succeed. | 
A Ae A A a E E C A AN A ES IS | 


CURSOR = CURSOR + ARG(NODE) : (S) 


Here F is a label in the scanner where all primitives go to 
upon encountering failure and S is the label they go to when 
they encounter success. Note that the primitive bumps the 
CURSOR. 


One may suppose that a routine to concatenate two path 
diagrams can be written in SNOBOLU very easily. Consider the 
following attempt. 


DEFINE('CONCAT(P1,P2) *) : (CONCAT END) 


E TE Á———S H———— —————————Á———————— 
| If P1 is null, just fail! | 
¡AE ISA E eed 


CONCAT IDENT (P1, NULL) : S (FRETURN) 


one ee ee ny E MIC ICD LC A 
| Otherwise fill up the S-vacancies of the alternate and | 
( subsequent. | 
| Ec SENE ENSE IT wv, d 
CONCAT (ALT (P1), P2) 
CONCAT (SUBS (P 1) , P2) : S (RETURN) 
| GNE IERI IGI E E N I LL CMM IRE M COE ML CE SEES | 
| Failure to CONCAT implies that the subsequent was null. | 
| Plug it! | 
A ————————— J————Á — Ee | 
SUBS (P1) = P2 : (RETURN) 
CONCAT_END 


The above routine is not valid for several reasons. 1. Path 
diagrams, as we will see later can have loops and this will 
possibly ensnare CONCAT in a recursive loop. 2. If the two 
arguments, P1 and P2, are identical the result is an abomina- 
tion. 3. The algorithm modifies P1, the first pattern. This 
is only permissible if it is known that P1 is not to be used 
for any other purpose. This guarantee, of course, does not 
exist. 


All three problems can be surmounted by copying the first pat- 
tern. Copying a graph with loops was treated earlier (COPYL, 
Prog. 5.8) and that function can be modified to perform the 
concatenation. See Exercise 7.4. A similar situation prevails 
with respect to alternation. 


A much more practical method, and one that is used by most im- 
plementors, is to group all the pattern nodes together in one 
contiguous block. This not only facilitates the copy operation 
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but increases the speed of sequencing through the nodes of a 
pattern. (Exercise 7.6 explores this possibility.)  Logically, 
however, it is correct to think of the pattern as being an 
inter-linked collection of nodes. 


| PPS EMERGENCIES CE HET 
£%£ ERIVED PATTERNS | Can a pattern be reconstructed from 
the path diagram? The answer is yes. 
( Let p(n) be the primitive associated with node n. 
| The derived pattern of node n, D(n), is defined in 
¥% |( terms of its associated primitive and the derived 
t--— patterns of its subsequent node s and its alternate 
a as follows: 


E. 
mE 
EE 
(E E 
E. 


D(n) = p(n) D(s) | D(a) if a and s exist 
= p(n) D(s) if only s exists 
= p(n) | D(a) if only a exists 
= p(n) if neither a nor s exists 


The derived pattern of a path diagram is defined as the 
derived pattern of its root. 


When the scanner is defined, it will be seen that it imple- 
ments the derived pattern. Also, it can be shown (Gimpel, 
1971] that any pattern will equal the derived pattern of its 
path diagram. Together these two observations constitute a 
proof of the pattern matching algorithm and provides a 
theoretical basis for the extensions which follow. 


E EN 

(| Program || The algorithm used internally to do pattern 
N 7.1 {1 matching is illustrated by the function 
11 SCAN N SCAN. SCAN has two arquments, the LENGTH of 
E _AAA<> the subject and a pattern identified by its 
root node NODE. The subject itself is held by a global 


variable SUBJECT and the current cursor value is held in a 
global variable CURSOR. There are good reasons for the selec- 
tion of which quantities are to be passed to SCAN and which 
quantites are global. These reasons will be evident when 
Unevaluated Expressions are discussed. 


The initial value of CURSOR is set by a driver program called 
MATCH (Exercise 7.8). In unanchored mode, if SCAN fails, MATCH 
increments this pre-cursor by 1 and calls SCAN again. The al- 
gorithm requires a stack and the familiar operations of PUSH 
and POP. The driver program initializes things by pushing a 
null alternate and a pre-cursor value. 


[77 T et ae ee USO E ae Saggy NUTUS ge RT NIE e De 
Basic SCAN function. The pattern identified by its root 
node NODE is matched against the SUBJECT at a pre-cursor 
position given by the global variable CURSOR. CURSOR is 
updated on success. The stack is another global quantity 
which SCAN modifies as a side-effect. If it fails, the 
start-up alternate-cursor pair are popped. On success, a | 
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| sequence of alternates may remain on the stack. | 
—————————————————————————Ó———— ——"—^———— À 
DEFINE (' SCAN (LENGTH, NODE) ') . 
DATA (' NODE (PROG, ALT, SUBS,ARG)")  :(SCAN END) 


| IS TEX IMP ne, CLE MC CADCM a EL ge an Te ae Wy PS Eye A ee eee CD = ae ee 
| Entry point and top of loop: If an alternate to the cur- | 
| rent node exists, push the alternate and the current | 
| cursor. i 
PS E E EEEE EEE ee ES A E S ) 
SCAN (DIFFER (ALT (NODE)) PUSH (ALT (NODE)) PUSH (CURSOR) ) 

| DOE IG E reg E D ONS E eee ewe CDM MES | 
{| Go to the program label associated with the current node. | 
| Return arrives at either S or F. | 
UTC" IMMO A A A 

: ($PROG (NODE) ) 


LEER I ED es i ECIAM E MCN ICM CL MR DE CE DICE EM M DEI CM ECOLE 
| Here on success. Set NODE to the subsequent. If there is | 


| none, we are done; report success. Otherwise go back to | 
| SCAN. 1 
p—— ——— —————————— ""—X— ——— ÀHT———————  ——— | 
S NODE = SUBS(NODE) 

IDENT (NODE, NULL) : S (RETURN) F (SCAN) 
p 


bacca IS E MC RUIT CÉCILE IA MM ER A MCCC aE CECI Pe aS E CREATE | 
| Here on failure. Pop the stack for an alternate. If null, | 
| fail. Otherwise attempt to SCAN at this node. | 
ERU M TD A A e" P A -———ÁÀ————— — A | 


F CURSOR = POP() ; NODE = POP() 
IDENT (NODE) : S(FRETURN) F (SCAN) 
SCAN_END 
Names_referenced Name Type Where defined 
by_SCAN: PUSH Function Program 5.5 
POP Function Program 5.6 


EURISTICS | Each implementation contains a certain 
£————————3 number of so-called pattern matching 
| heuristics which are intended to increase the speed 
( of matching while having minimal effects upon the 
| success or failure of the match. Generally they fall 
GC —3 into two categories, those which speed up matching 
without affecting the overall outcome of the match (termed 
unobtrusive) and those which may have some effect on the out- 
come of the match (obtrusive heuristics). The programmer may 
turn off all heuristics by setting &FULLSCAN = 1 in which case 
he is said to be matching in FULLSCAN mode. Otherwise he is 
operating in QUICKSCAN mode. At this writing he cannot selec- 
tively turn off individual heuristics or, for example, choose 
the unobtrusive but suppress the obtrusive heuristics. There 
are four heuristics: futility, length-checking, start-up and 
recursive reduction. None of these heuristics are in- 
trinsically obtrusive kut under certain assumptions they may 
indeed become obtrusive. There is a fifth heuristic which is 
a protection heuristic as opposed to a speed heuristic. Its 
purpose is to catch programming errors. The pattern 
ARBNO (NULL) will loop forever in FULLSCAN mode. In QUICKSCAN 


$ $ 
E & 
KEE 
Rh s 
% & 


mode, the scanner checks the number of characters matched by 
the argument to ARBNO and terminates if 0 characters were mat- 
ched. Some implementations have not included this heuristic 
and its inclusion in a language which permits arbitrary state- 
ment looping seems questionable. We will not consider it 
further. 


Futility - Under FULLSCAN the driver program successively 
calls SCAN for all cursor values with the given subject in the 
order of increasing cursor position. But such a procedure can 


be woefully time-consuming as in the following common example. 
S BREAK(';') . K 


which causes string S to be scanned for a semicolon and, if 
found, assigns the initial substring to K. Under FULLSCAN, a 
failure at CURSOR = 0 will cause a repeat at CURSOR = 1 which 
will necessarily also result in failure, etc. A total of L + 1 
scans will be made where L is the length of the string. The 
wary user can anchor the scan either by prefixing a POS(0) to 
the pattern or by using 6&ANCHOR mode. However under QUICKSCAN 
mode, the futility heuristic will cause an abrupt halt of 
scanning after the first failure. 


A pattern is said to ke futile for a certain cursor c if it 
fails at this and all advances of the cursor position. That 
is, if 


(c')P = Ø for all c' 2c 


then P is futile for cursor c. If BREAK(S) fails at cursor c 
it is also futile at cursor c. Hence, in the above example, 
additional scanning at advanced cursor positions is not 
needed. But it is not always possible to make a simple test 
to determine the futility of a pattern. If the pattern is the 
string 'XXX' and the subject is ‘ABCDE' the pattern is futile 
for any cursor position but normally this is not discovered 
until after at least 3 attempts are made to match  'XXX'. 
Hence, string patterns report futility only when there is 
insufficient length in the subject string. This is termed 
length failure. For convenience, whenever a primitive detects 
futility, it is said to experience length failure, or simply, 
to length fail. Thus, when BREAK fails, it reports length 
failure even though, strictly speaking, the futility is not 
due to an insufficient number of characters. 


If a pattern primitive detects that it is futile, it branches 
to a length-failure exit (LF). Otherwise it branches to match- 
failure (MF). Both of these are in lieu of the single fail 
location (F) in the function SCAN. Most pattern primitives 
can transmit futility detected by a subsequent. This means 
that if pə is the subsequent of Pı, and if p; reports length 
failure, p, can also report length failure. More formally, 
the primitive p is called a transmitter if, whenever any pat- 
tern P is futile at cursor c, and if (c')p = c, then (p P) is 
futile at c'. 
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A necessary and sufficient condition that a monic pattern p be 
a transmitter is that p ke monotonic in the sense that any 
increase in pre-cursor position brings about a non-decrease in 
post-cursor position. Virtually all primitives in SNOBOL4 are 
monotonic. Hence the scanner makes the assumption that all 
primitives are transmitters. Under the transmitter assumption, 
if all local failures are length-failures then the overall 
pattern is futile. 


For example, let 


Subject: "ABC. cc cc cece cccccccc aD! 

Pattern: "ABC!  BREAK('D') ‘DE! 
Then the 'DE! when matched against the 'D' will length-fail 
indicating futility. BREAK('D') is a transmitter since its 
oost-cursor position cannot possibly back-up if its pre-cursor 
advances. Hence  (BREAK('D') 'DE') is futile. By a similar 
line of reasoning, 'ABC' is also a transmitter and hence the 
entire pattern is futile. The initial cursor position, 


therefore, need not be advanced beyond 0. 


The futility heuristic is implemented by a global flag which 
is set on at the start of a scan and is turned off at any 
match-fail or if a non-transmitter succeeds. The flag is 
called the futility flag. If the futility flag is on when the 
Overall pattern fails, it is useless to go on. The overall 
pattern is futile. 


The futility heuristic is unobtrusive for patterns which are 
nonvarying. For varying patterns the heuristic becomes obtru- 
Sive. For example, the pattern matching statement 


' ABXB'! ANY('AB') $ C BREAK(*C) 


will first assign ‘At to C and the pattern BREAK(*C) will 
fail. BREAK signals length failure and the scanner erroneously 
concludes that the entire pattern is futile. Should the pat- 
tern be matched with a pre-cursor of 1, C would be assigned 
the character 'B' and the subsequent BREAK would succeed. 
Hence the pattern was not futile. The difficulty stems from 
the fact that BREAK lied. If its argument is indeed an 
unevaluated expression, it should not signal length failure 
unless there are no characters left in the string. 


ARB is a pattern which can use the futility heuristic in two 
ways to hasten scanning. If the subsequent to ARB is futile 
at any given cursor then ARB need not extend. Moreover, (ARB 
P) where P is the subsequent will be futile. For example: 


Subject: 'AXXXBXXX'! 
Pattern: 'A' ARB UB! ARB 'C! 


In the above, the ‘At will be matched against the first 
character. ARB will match 0, then 1, 2, and 3 characters until 
'B' succeeds. The second ARB will match 0, 1, 2 characters 


until 'C' is futile. Hence, ARB 'C' is detected as being 
futile at position 5 and ARB 'B' ARB 'C' is detected as futile 
at position 1. The scanner can halt immediately. The futility 
heuristic for ARB is implemented by pushing the original state 
of the futility flag onto the stack. When the subsequent to 
ARB signals futility ARB restores the state of the futility 
flag and takes the length-fail exit. If ARB receives no in- 
dication of futility for all post-cursor positions up to and 
including I, the length of the subject, then ARB should in- 
dicate match failure. 
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beginning with POS(n) to be applied immediately at CURSOR = 
n. The effect is an anchored mode except that the anchoring 
is done at a position other than CURSOR = 0. Both SPITBOL and 
SITBOL use this heuristic and SITBOL also uses a similar 
heuristic for patterns beginning with RPOS(n). Another start- 
up heuristic exclusive to SITBOL is so-called contextual 
anchoring. Many patterns will only match substrings beginning 
with certain letters. For example SPAN('ABC') can only match 
a substring starting with one of these 3 letters. The pattern 
'CAT' |  *DOG* will match only a string beginning with 'C' or 
tD!. Rather than call SCAN at each cursor position, it is 
faster if the driver program makes a rapid pre-scan (at BREAK 
speeds) to a point where a pattern would find a letter that it 
could possibly begin matching. Failure at the first contextual 
anchor point implies a repeated attempt to scan for the next 
contextual anchor point. The alternation of two patterns 
which are both contextually anchored is also contextually 
anchored by the union of the anchoring sets. The concatenation 
of two patterns is always anchored by the anchoring, if any, 
of the left-most pattern. The start-up heuristics in all their 
variations are unobtrusive. 


Length Checking - This check operates as follows. In the 
course of building a pattern the pattern builder deduces a 
minimum length for each node. During a match, if the number 


of characters remaining in the subject is below this number, 
then the node can immediately signal length-failure. The dif- 
ficulty with this technique is that it takes time to make this 
test and it effectively duplicates another test made concur- 
rently, the futility check. For example suppose the pattern 
is the string 'ABC'. Suppose the subject is '1234567'. The 
minimum length required by the pattern is 3. The length check 
is made 6 times. The first 5 times indicates that there is 
sufficient room in the subject. The last time a check is made, 
the length fail exit is given. However if the primitive were 
given control it would also have length failed so that the 
test is redundant. Moreover the primitive could have deduced 
that after the 5th time it was futile. If it signals length 
failure when there are 3 characters remaining (which it should 
ideally do) then the minimum length check never gets a chance 
to signal length failure. All of its activity went to increase 
the time of scanning. The length test came historically before 
the futility heuristic and its retention is probably for that 
reason. 
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Length-checking would not be obtrusive if it were not for the 
so-called one-character assumption. Any unevaluated expression 
is assumed to match at least one character. For example 


(LEN (1) $ x) (LEN (1) $ Y) *LGT (X, Y) 


will look for two characters out of order in a string. Unfor- 
tunately, if the two characters are the last two of the 
string, it will not find them because the predicate is assumed 
(erroneously) to consume one character. This is perhaps the 
most obtrusive heuristic of them all since the case of 
predicates within a pattern are extremely common and would be 
even more so if it were not for this heuristic. The length- 
test heuristic appears only in SPITBOL and MAINBOL. SITBOL 
and FASTBOL avoid this test for the reasons indicated. 


Recursive Reduction - This refers to the scheme whereby 


SNOBOLU is able to break left-recursive loops as in the 
pattern: 


Pp = *P tAt | tpt 


We will defer a discussion of this heuristic until after the 
implementation of recursive patterns is considered. 


Co ee ee A 

{ ###€ OMPOUNDS | Some built-in patterns are not implemented 
1 $ r—————-4 by a single node, either because they are 
1% | not monic or because it is more efficient to imple- 
¡y € f ment them as several nodes rather than one node. 
| £888 | These patterns are predefined by a path diagram of 


NW two or more nodes and are called compounds. Examples 
of compounds are the patterns with implicit alternatives such 
as ARE, BAL, and ARBNO(p). 


ARB 


A pattern which does nothing but succeed is called nil. The 
node for nil is shown below 


S 
—————————À 


{ subsequent | 


| alternate | 


where S refers to that label in the scanner to which control 
is passed in the event of a successful match. Since the primi- 
tive is effectively short-circuited, this is the fastest 
possible successful pattern. The null string may be coded as 
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the nil node (it is not normally). There is no argument for 
nil. 
ARB can be thought of as being recursively defined as 

ARB = NULL | (LEN(1) ARB) 


and this leads to the compound shown in Figure 7.4. Here, 'a' 
denotes the alternate to ARB and 's' denotes its subsequent. 


a r———!1 LEN(1) | 
| { | 
A l A | 
M | A 
: i > 
> | ; 
peo | rmn 
( nil | CANA Y nil ([————»? s 
| |———————— | | 
CP Et 
Figure 7.4 


A compound for ARB. 


Figure 7.4, though conceptually simple, is not the most ef- 
ficient form of ARB. The futility heuristic as applied to ARB 
needs to be implemented (see Futility) and more scanner ac- 
tivity can be incorporated within the ARB compound with a 
consequent gain in efficiency. The more efficient ARB realiza- 
tion is shown in Figure 7.5. 


a | ARB2 |——————— ———À 

Ell | 

A A { 

: - | 

: : | 

£—————3À v 

{ ARB! | »| nil { > s 
AS | ——— J 
Figure 7.5 


An improved version of ARB. 
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The associated primitives ARB1 and ARB2 are defined as: 


RS aS ER ICM NEM CD REE ESTE | 
| Save the state of the futility flag and set it in order to | 
| detect it in the subsequent. | 
A ac ee el 
ARB1 PUSH (FUTILITY) 

FUTILITY = 1 : (S) 


| cR EPI yy Ce DIET XL Xx ccc XI cc I EC a pUM KR MEE RE ee 
| If the subsequent is futile, restore the old futility flag | 
| and length fail provided we're in QUICKSCAN mode. | 


Cae me 


ARB2 FUTILITY = EQ(FUTILITY,1) EQ(&FULLSCAN, 0) 
+ POP () : S (LF) 
PET 


————————————————— 
{| Else bump the cursor and compare with LENGTH of subject. | 
| If beyond the end of the subject, pop the old futility | 
| flag and match-fail. | 
a a qe e E A SPRUNG E ESSET Mop E Mc HMKS. 
CURSOR = CURSOR + 1 
(GT (CURSOR, LENGTH)  POP()) : S (MF) 
RS RE SETS IEEE CIS NCC CI C CC M a A E LUE C CDS E IUNII GE C NC DN 
| Otherwise, play scanner by pushing ourself and the current | 
| cursor onto the stack and succeed. | 


LC. Vane a AAA IR | 


PUSH(NODE) ; PUSH (CURSOR) : (S) 


Note the action of ARB if its subsequent is futile. ARB itself 
is regarded as being futile and it indicates this condition by 
restoring the state of the futility flag. Note that this al- 
gorithm is obtrusive if the subsequent is varying. For exam- 
ple, the pattern matching statement 


'ABCB' LEN(1) $ X ARB 'C' *X 


will succeed in FULLSCAN mode with X matching 'B' but will 
fail in QUICKSCAN mode. In QUICKSCAN mode the ‘At is assigned 
to X initially; when 'C' match-fails, control arrives at ARB2 
which increments the cursor. Ultimately,  'C' length-fails. 
When control arrives at ARB2, the FUTILITY flag is still on 
resulting in a length failure and termination of the match. 
If is important that ARB length-fail if its subsequent is 
futile. Consider the pattern match 


S ARB . T ‘CAT! 


which scans S for 'CAT* assigning the prefix to T. If no ‘CAT! 
exists in S, the match will require on the order of L? matches 
under FULLSCAN and on the order of L matches under QUICKSCAN 
where L is the length of the string. Here the desire to have 
unobtrusive heuristics seems to collide with the need for an 
intelligent scanner. No completely satisfactory scheme has 
yet been worked out. 


BAL 


Define a balanced string as any string which either 1) does 
not contain a parenthesis, or 2) is a balanced string bounded 
by parenthesis or 3) consists of any sequence of balanced 
strings. The BAL pattern of  SNOBOLU matches all nonnull 
balanced strings beginning at a given pre-cursor position. The 
sequence of post-cursor positions is from smaller to larger. 
It is relatively straightforward to write a monic pattern to 
match the earliest (i.e. shortest) balanced string starting at 
a pre-cursor position. A parenthesis count is maintained. If 
a left paren is encountered the count is incremented by 1. If 
a right paren is encountered the count is diminished by 1. If 


the count ever goes negative the monic fails. If the count 
reaches 0 (after the first character), a successful match is 
reported. This monic pattern is available as a primitive 


(called GBAL) within the implementation and is used to imple- 
ment BAL. As an example the table below shows the behavior of 
GBAL on the subject 'A(C()D)'. 


Pre-cursor 0 1 2 3 ü 5 6 7 
Post-cursor 1 7 3 5 - 6 - - 
where a dash (-) indicates failure. BAL can be written in 


terms of GBAL as 
BAL = GBAL ARBNO(GBAL) 


and the corresponding BAL compound is shown in Figure 7.6. 


a (EE E SE EE E: EE EE E: E E E E: E e oe 

A : : 

: v : 
| a | C+ - | V^ 1 
( nil {————>] GEAL |-———>| nil [———————»? s 
Lll yP--- Ca | as: | 


Figure 7.6 


The BAL compound. 


The GBAL primitive, as the above example illustrated, is not 
monotonic and hence does not transmit length failure. GBAL, 
therefore, turns the futility flag off if it succeeds. If the 
subsequent s is futile, further alternatives need not be 
taken. 
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ARBNO (p) 


The path diagram for ARBNO(p) is obtained from the path 
diagram for p in the by-now familiar method suggested by the 
examples of ARB and BAL. Figure 7.7 indicates how we can form 
this path diagram from the path diagram for the pattern p. 


0000000000000 
O O 
p< O O 
a | <——O p O 
l...O O 
A | <——0O O 
t i 0000000000000 
: | A 
: | : 
po l ETER. 
| nil y AAA Y nil [————»^? s 
| l >1 | 
Lisa LAS 
Figure 7.7 


A path diagram for ARBNO(p). 


Variable Association 


An expression of the form p. v where p is a pattern and v is 
a variable (or an unevaluated expression which will evaluate 
to a variable) is called a conditional variable association. 
The variable v is associated with the indicated pattern and 
will be assigned the substring matched by p on the condition 
that the overall pattern is successful. An expression p $ v 
is called an immediate association. Any substring matched by 
p is assigned immediately to v. The path diagram for p . v 
can be given in terms of the path diagram for p and is shown 
in Figure 7.8. A similar diagram could be drawn for p $ v. 


The stack which receives alternates and cursor values during 
the course of the match is called the pattern matching history 
stack or PM stack for short. TO describe the operation of 
conditional variable association, we postulate the existence 
of two more stacks which we will refer to as stack Alpha and 
stack Beta. When VA1 (Variable Association 1) receives con- 
trol, it pushes the current cursor (pre-cursor position) onto 
stack Alpha. If p should fail, VAB1 (Variable Association on 
Backup 1) will receive control and it will pop Alpha. It will 
then fail forcing control to go to alternate a. Should p suc- 
ceed, control arrives at VA2. The current cursor and the pre- 
cursor pushed by VA1 are sufficient to define the string to be 
assigned to variable v. The two cursor positions and v are 


a 
A 
| VAB1 | { VAB2 | 
| AN | LAS 
A A 
- 0000000000000 - 
: O O : 
— 0 0———> ——— 
{| VAT |—>0 p Q————»([———————-—»5»| VA2 |———— s 
A 1 O O e o 0 e | A | 
O OS 
0000000000000 
Fiqure 7.8 


A compound for p. v 


pushed onto stack Beta and the cursor on stack Alpha is 
popped. Should the subsequent fail, VAB2 gets control and un- 


does what VA2 did. That is, the three values on Beta are 
popped and Alpha is pushed with the original pre-cursor  posi- 
tion. VAB2 then fails forcing alternates on the PM stack to 


be invoked. 


If the overall match is successful, Beta is scanned on a  FIFO 
basis  (left-to-right) and assignments are made in turn. If 
the variable is an unevaluated expression, tne evaluation is 
made at this time, by a possibly recursive call. 


Stack Beta is normally called the name-list stack. It operates 
in synchronism with the PM stack and, hence, it would have 
been possible to use this latter stack to push the two cursor 
values and the variable. It would not normally be difficult 
or time-consuming to extract these values from the PM stack at 
termination of matching. But differences in the way the gar- 
bage collector treats each stack may make a separate name-list 
stack desirable. Here, implementation considerations at the 
bit level often determine whether 1 or 2 stacks are used for 
this purpose. Stack Alpha, on the other hand, grows dif- 
ferently than the PM stack. The overall system stack which is 
employed for expression evaluation and recursive calls is 
used. The system stack, as we will see, may be active during 
pattern matching (to implement unevaluated expressions) but 
its net growth from the beginning of processing of one node to 
the beginning of processing of its subsequent is always 0 (un- 
less used as the Alpha stack of substring assignment). 


Immediate variable association is similar but simpler than 
conditional association and will be left as an exercise. 
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NEVALUATED EXPRESSIONS | Unevaluated expressions may 
be used as patterns and, if 
| so, are evaluated during a pattern match. The 
| result of such an evaluation may be any pattern, 
| even one containing unevaluated expressions. The 
SM difficulty with handling unevaluated expressions, 
which can result in arbitrary path diagrams, is to effectively 
combine the new path diagram with the old. In principle, this 
path diagram could be fused into the overall pattern by means 
of the pattern building process discussed earlier. However, 
since this pattern is evaluated whenever the scanner is moving 
forward through the pattern, this pattern building process may 
take place many times during a single pattern match. Worse, 
the pattern would have to ke detached before the next new pat- 
terns were joined and this would promise more difficulties. 
Hence, rebuilding the pattern is not a satisfactory solution. 


$ 
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Let STAR be the program label associated with that part of the 
system which is to process unevaluated expressions. The argu- 
ment in the node associated with STAR is the unevaluated 
expression which we assume that STAR can readily evaluate. We 
note that the evaluation of the argument can evoke a 
programmer-defined function which can, by virtue of it perfor- 
ming pattern matching, reenter the scanner. This requires 
that, before the unevaluated expression is evaluated, a host 
of values such as the cursor position, the subject, the cur- 
rent value of the push-down list, and the NODE position be 
placed in the system stack to be restored after the argument 
is evaluated. In our pseudo-implementation of pattern matching 
all this is taken care of automatically be declaring the ap- 
propriate variables to be either parameters or temporaries of 
the function MATCH. 


Assuming that this is done, the result of this evaluation is a 
pattern P.. What STAR must do is apply this pattern to the 
subject at the given pre-cursor position. This can be done by 
a call (recursive) to the function SCAN if we first provide 
isolation between this call and previous uses of the stack. 
This takes the form 


STAR P = EVAL(ARG (NODE)) 
PUSH(NULL) ;  PUSH(CURSOR) 
SCAN(LENGTH, P) : F (MF) S(S) 


It is a minor detail but if the result of evaluation is an 
unevaluated expression it is again EVALed. Assuming that a 
pattern P emerges from the evaluation procedure it is applied 
to the subject at the current cursor position by means of the 
call to SCAN. If P fails, the insulating null-cursor will have 
been popped and SCAN will fail. In this case STAR simply 
relays the failure. If P succeeds, SCAN will succeed and STAR 
reports success. If the subsequent to STAR is ultimately suc- 
cessful, nothing more need be said. If unsuccessful, the list 
of alternates laid down on the stack by P must be invoked. But 
they cannot be invoked straight-away as any gyrations of their 


own accord would cause success or failure of the evaluated 
pattern P to be interpreted as success or failure of the pat- 
tern as a whole. Hence a kind of second insulation is set up 
to receive control should s fail. This comes in the form of 
the primitive RESTAR shown in Figure 7.9. 


a £—————————| RESTAR | 
| | | 
A i Ls 
: A 
H | : 
- | : 
ra l ra 
| STAR | E A 8 
| | ———————————————5 | | 
Ls an 
STAR P = ARG(NODE) 
STAR_1 P = EVAL (P) <F (MF) 
IDENT (DATATYPE (F) , ‘EXPRESSION ') <S (STAR_ 1) 


PUSH (NULL) ; PUSH (CURSOR) 
STAR_2 REDUCTION 0 


REDUCTION EQ(&FULLSCAN,0) RESID (NODE) 

GT (REDUCTION, LENGTH) :S (LF) 

SCAN (LENGTH - REDUCTION, P) :F (MF) S (S) 
RESTAR CURSOR = POP() 

P = POP() 

IDENT (P, NULL) : S (MF) F (STAR_2) 
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A compound to implement Unevaluated Expressions. 


When RESTAR receives control it pops the stack. If the alter- 
nate is null, this is the insulating null-cursor pair and 
RESTAR simply fails. Otherwise it merges with the STAR primi- 
tive which calls SCAN with the popped alternate as argument. 


The previously cited Recursive Reduction heuristic is shown in 
Figure 7.9. A fifth field of a pattern node is called the 
residual. This equals the minimum number of characters  re- 
quired by the node's subsequent to match. The field name used 
is RESID so that the data statement for a pattern node should 
really read 


DATA (' NODE (PROG, SUBS, ALT, ARG, RESID) ') 


Residuals are computed by assigning a minimum length string to 
each pattern. For example, the minimum lengths of BREAK(S), 
TAB(N), POS(N) and FENCE are each 0. The minimum length of 
SPAN(S) and BAL are each 1. The minimum length of a string is 
the size of the string, etc. The minimum length of the 
concatenation of two patterns is the sum of their minimum 
lengths. The minimum length of the alternation of two patterns 
is the minimum of their minimum lengths. When two patterns 
are concatenated, the residual of each node is incremented by 
the minimum length of the second pattern. When two patterns 
are alternated, all residuals remain unchanged. The minimum 
length of a pattern can either be partially recomputed for 
each concatenation from the residual of the root node and the 
minimum length of the root or may be stored in a pattern 
header where global information about the pattern is kept or 
may be retained separately for each node in another field 
(MINLEN) of the pattern node. 


As an example of the recusive reduction heuristic 
P = *p 'A' | 'B' 


will not loop. Since the residual of *P is 1 (the minimum 
length of 'A'), SCAN is called with ever decreasing  LENGTH'S. 
On the other hand 


P = *P BREAK('A') BREAK('B') | 'B' 


will loop because the residual of *P is 0. Note that 
BREAK('A') BREAK('B*) matches at least one character but the 
simple-minded minimum-character algorithm fails to detect 
this. 


It is not uncommon to experience the BNF-like expression 
P = *p *Q | ‘At 


This pattern would loop if it were not for the drastic assump- 
tion that unevaluated expressions require a single character 
to match. This is the so-called one-character assumption. 
Given this assumption, the residual of *P is 1 and so the num- 
ber of recursive plunges is limited by the length of the 
string. Note that the one-character assumption has nothing to 
do with the number of characters required by *P but only *Q. 
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| Exercise 7.1 | Implement the BREAK(S) primitive (call it 
3  BREAKP) in SNOBOLU source in a manner 
similar to the way in which the LEN(N) primitive (called 
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'LENP') was implemented in the text. Assume that ANY(S) and 
POS(N) are available. 


eos ee ee oe ee 

| Exercise 7.2 | There is a single pattern primitive called 
(A  CHARP which is used in matching any string 
against the subject. The string is contained in ARG (NODE) 


while PROG(NODE) contains CHARP. Assuming SUBSTR (Prog. 3.9) 
is available show how CHARP could be implemented in  SNOBOLU 
source. Pass control to LF or MF on failure depending on 
whether or not the pattern is futile. 


O | 
| Exercise 7.3 | After executing the instructions below, (a) 


LLL——————————-4 how many s-vacancies will there be in P? (b) 
how many a-vacancies? Express your answer in terms of N. 


P = ‘At 

I = 0 
LOOP P = (P|P) (P | P) 

I = I+ 1 LT(I,N :S (LOOP) 
poU Ci e i 
| Exercise 7.4 | As indicated in the text, to properly 


LL————————————-A  concatentate two patterns, the first must be 
copied. Assuming the patterns are linked structures as in- 
dicated in the function CONCAT, implement CONCAT as a modified 
form of COPYL (Prog. 5.8). 


A A 
| Exercise 7.5 | A path diagram is well-formed if (1) any se- 
AS quence of alternates ends in an a-vacancy 
(i.e. no loop of alternates exist) and (2) no loop of subse- 
quents exist. Show that any path diagram formed by alter- 
nating, concatenating or ARBNO!ing (see Figure 7.7) well- 
formed (and distinct) path diagrams produces only well-formed 
path diagrams. 
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| Exercise 7.6 | One implementation of patterns encodes them 
AÑ as a contiguous set of nodes together with a 
header to form one large array as shown in Figure 7.10. 


The root node is always node 1. The MIN field is the minimum 
length string that the pattern will match. FLAG and START are 
used as the anchoring field. If FLAG is 1 and START contains 
N, then the pattern is anchored in the form POS(N) ... If FLAG 
is -1 then the pattern is anchored in the form RPOS(N) ... If 
FLAG is 0, no special anchoring heuristic exists. 


The alternate and subsequent fields contain the subscript of 
the target nodes. If empty, these fields contain some nonposi- 
tive integers. 
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—Header 


—Node 1 


—all other nodes 


Figure 7.10 


The data structure for a practical implementation 
of patterns. 


Write a subroutine to build (a) the alternation and (b) the 
concatenation of two patterns and (c) find the ARBNO of one 
pattern. 


A | 

| Exercise 7.7 | How many primitive matches (successful and 
SW unsuccessful) are involved in the following 
pattern matching statements? 


(a) 'ABCDEFGHIJKLMN' 'EF' | 'C' 

(b) DUPL('A',20) 'B* AN 

(c) DUPL('A*,20) QN 'B! 

(d)  'AABAAACE' (*C* | 'Dty QUE" | 'F') 
(e) DUPL('A',20) SPAN('A') | BREAK('A!) 


(f) 'AABAAC' SPAN('A') ‘Cc? 


Cree CES | 
| Exercise 7.8 | Write the MATCH function which serves to 
AA drive the SCANer. Be sure to set and test 


the futility flag (FUTILITY) if &FULLSCAN is off and check 
SANCHOR. MATCH will have two arguments, the subject S and the 
pattern P. Have MATCH fail if the pattern fails and return 
the string matched if it succeeds. Be sure to indicate which 
variables are temporary. 


Coo. yt a gy ae ee 
| Exercise 7.9 | Which of the following monic patterns are 
NJ transmitters of futility? 


(a) SPAN('AB') | NOTANY('AB!) 

(b) TAB(N) | POS(N + 1) 

(c) 'ARAt | ‘Bt 

(d) "ABCD! | 'DCBA' 

Coe se ee UU 

| Exercise 7.10 | Which of the following patterns are contex- 


AS tually anchored and what is the character 
set in each case? 


(ANY ('AB') | SPAN('DE') | 'CAT') LEN(3) 
POS(3) BREAK('AB!) 

(‘At | (SPAN('B') | 'CAN!)) 
ARBNO (ANY ('AB*)) 
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| Exercise 7.11 | If the subsequent P to the pattern TAB(N) 
AAA fails (even if the failure is match- 
failure) one may presume that TAB(N) P is futile and no 
increase in cursor position can help. How would we implement 
TAB(N) to take advantage of this? 


FEAR I, 

| Exercise 7.12 | If a user requires that BAL match the null 
AS string he may very easily create a pattern 
which will provide this extension. He may write: 


NEW_BAL = NULL | BAL 
(a) Draw the resulting path diagram. 


(b) Design a compound for implementing NULL | BAL directly 
(using GBAL of course). 
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| Exercise 7.13 | In QUICKSCAN mode, if the subsequent to 
CS ARBNO(P) is futile, no further extensions 
need be taken provided P cannot match a string of negative 
length. The compound shown in Figure 7.11 below is designed 
to implement this heuristic. Describe the operation of the 


primitives ARBN1 and ARBN2 in SNOBOLY source, i.e. in a manner 
similar to the descriptions of ARB1 and ARB2. 
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Figure 7.11 


A path diagram to implement a futility heuristic 
for ARBNO. 
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| Exercise 7.14 | Design a compound for implementing BREAKX () 
NJ (the SPITBOL function, see Prog. 8.2) as- 
suming that the BREAK primitive and LEN(1) are available. 


O 

| Exercise 7.15 | Describe how you would implement the pat- 
AÑ tern NOT(P) defined as matching the null 
string if P fails, failing if P succeeds, and aborting if P 
aborts. 


CS ee EA 
| Exercise 7.16 | In chapter 6, ARBNO(P) was defined as 
—————————— 


ARBNO(P) = NULL | P ARBNO(P) 


Show that the derived pattern of the path diagram in Figure 
7.7 is 


ARBNO (P) D(s) i D (a) 
where P is the derived pattern of the path diagram p. You may 


assume in your proof that P does not match the null string. 


a. 
| Exercise 7.17 | The scanner function operates in such a 
A manner that the pattern implemented is the 
derived pattern: 


p D(s) 1 D(a) 
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Rewrite SCAN so that the derived pattern is: 


D(a) | p D(s) 
| ME re ee ee | 
| Exercise 7.18 | Rewrite SCAN to implement the derived 
NA pattern 


(p 1 D(a)) D(s) 


(Hint: study STAR and RESTAR carefully and do not un- 
derestimate this problem.) 
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| Exercise 7.19 | To eliminate one of the nil nodes of Figure 
t——— 7.4, it is proposed that the alternate be 
thung off' the LEN(1) node, eliminating the first nil en- 
tirely. Show that the derived path diagram of this combination 
does not equal 


ARB D(s) | D (a) 
as it should. 
CoS ee a ee 
| Exercise 7.20 { Assume that a flag exists called UEFLAG 


CA which is set by STAR to indicate that an 
unevaluated expression was encountered. Modify ARB so that 
the length fail heuristic is unobtrusive but so that ARB 
reports length fail if there are no unevaluated expressions 
encountered in the subsequent to ARB. 
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rmai atterns are data objects and, as such, enjoy the same 

( rights and priviliges bestowed on objects having the 
|c—— more conventional typings of STRING, INTEGER and REAL. 
E In particular, patterns may be assigned to variables 
ce (possibly array elements or field variables) and may be 
passed to and from functions. This chapter tends to 
demonstrate these capabilities and describes a number of  use- 
ful (and not-so-useful) pattern-valued functions and also 
provides a few very practical patterns for analyzing common 
linguistic cases. 


A word perhaps should be said about the virtue of attempting 
to solve as much of the problem as possible with one big pat- 
tern match. This can obvously be overdone. For example: 


S (REM $ OUTPUT FAIL | LEN(1) . T REM. S) 


serves to both print the string S and separate it from its 
first character. This has the same effect as: 


OUTPUT = S 
S LEN(1) . T REM. S 


The two-line version is clearer and, if anything, more ef- 
ficient and is easier to type and modify. The one-line version 
might perhaps be written to be cute or perhaps in the mistaken 
belief that statement overhead is significant (it is not). 


There are, however, often excellent reasons for using one pat- 
tern match as opposed to two or more. Consider looking for a 
quoted literal while analyzing SNOBOL4 source. Assume S con- 
tains a valid SNOBOLU statement and assume we wish to search 
for the existence of a quoted literal assigning it to the 
variable X and transferring to NONE if none exists. One poor 
attempt is: 


Q = wt ve 

00 = eee 

S (Q BREAK(Q) Q) . X : S (AROUND) 
S (QQ BREAK(CQ) QQ) . X : F (NONE) 


AROUND 
If the two pattern matches are replaced by: 
S (Q BREAK(Q) Q | QQ BREAK(QQ) QQ) . X : F (DONE) 
the result is not necessary clearer or more efficient but does 


have the beneficial property of not being wrong. If the string 
S contained 


then the two-pattern case would have erred. 


SA EE a a ES E O EE ED eS AO a A AA DARA SE MO CEMEE) cate ADA AED EE GUNED ED ED ES TIED PS RD SEED ED 


There are times when a single large pattern can take the place 
of many lines of code. I have seen a case where a programmer 
wrote a machine-language subroutine (to be called from 
SNOBOL4) to parse the 360 assembler language where this parse 
can be written as one not-too-complex pattern (ASM360, Program 


8.11). The reason I saw it at all was because the program 
became a hopeless jumble and the writer of the program was 
virtually lost in a sea of complexity. The mistake made here 


was to assume that because, in assembly language, each step is 
quite clear, that the composition of an arbitrary number of 
such steps should also be clear. Programming offers no more 
vivid testimony than to deny this assumption. 


[ET PLAINE 

{{ Program |i There are cases when it is desirous that the 
0! 8.1 E pattern BREAK(S) match the entire string if 
i| BRKREM li (and only if) there are no break characters 
——————— found. If it were not for the 'only if' 


proviso, the pattern 
BREAK(S) | REM 


would do. But this pattern has the potentiality of matching 2 
strings; i.e. it is not monic. 


a INC DM CIENCIA MCCC A gy a eg ee EIC MCN ge REOS 
| BRKREM(S) returns a pattern that will behave like BREAK(S) | 
{ if that pattern would succeed and will match the remainder | 
| of the subject string otherwise. | 
——— ——— ———— ———————————— ———————————————M——ur 


DEFINE ('BRKREM (S) CS!) : (BRKREM, END) 


(er Merci. c | 
{| If S is null there are no break characters. Return a pat- | 
| tern which will consume the rest of the string. | 
| REA ———— ——— — ——— —  ————— — o (— —— —P— M a —PMÀ D. | 


BRKREM BRKREM = IDENT(S) REM 2: S (RETURN) 


E "cp Eye A Se re C ICI c CC SEDEM M ee ee. | 
| Find the set complement (CS) of S. If this is null, BRKREM | 
| should match the null string. | 
| —— ———À———— ——S —————— — — — V ———— — ————a—qa———!—— 
CS = DIFF(&ALPHABET, S) 
IDENT (CS) : S (RETURN) 
Ws re ee ey Ne S PECIA MTS DS C CD C DCN I EI C CC CI Rp ES | 
| Otherwise return the alternation of 3 mutually exclusive | 
| cases. | 
A ————————À——————  ———J——— E O A O AN | 


BRKREM = RPOS(0) | SPAN(CS) RPOS(0) | BREAK(S) 
: (RETURN) 
BRKREM_END 
Names referenced Name Type Where defined 


by BRKREM: DIFF Function Program 3.10 
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E | 
{{ Program |! The pattern BREAK(S) where S is a string 
(1 8.2 B will rapidly scan for one of the characters 
(| BREAKX li in S, stopping just short of the found 
MÀ character. The scanning is done as fast as 
the hardware will allow and, for 360 implementations this is 
quite rapid. But suppose the problem is not to scan for a 


character but for a string S. This can be done quite easily 
by the statement 


SUBJECT S 


TO speed up the search, we might think of using BREAK to scan 
for the initial character of S as follows 


S LEN(1) . INITIAL 
SUBJECT  POS(0) BREAK (INITIAL) S 


this will succeed if S appears at the first instance of its 
initial character. Otherwise the pattern would fail since 
BREAK cannot match a string containing INITIAL. If we were to 
remove the POS(0) the pattern would 'work' in the sense that 
it would succeed when required but the time required to do so 
could be worse than before. This is because the scanner would 
increment the cursor by 1 after each failure and thereby move 
quite slowly toward its destination. To fix the situation we 
define a function called BREAKX (BREAK eXtended) which, upon 
failing, will extend past the break character to find another. 
Like BAL and ARBNO, BREAKX is said to have implicit 
alternatives. 


BREAKX was first introduced as a built-in function in SPITBOL 
and appears in SITBOL and FASBOL. 


DEFINE ('BREAKX (S) *) : (BREAKX_END) 

BREAKX  BREAKX = BREAK(S) ARBNO(LEN(1) BREAK(S)) 

: (RETURN) 
BRFAKX END 
pM MUNI 
Il Program |! In analyzing programs BAL can be quite use- 
E 8.3 BE ful but it is also limited in that it cannot 
E BAL 11 be applied freely to expressions which per- 
t__________________f mit quote marks. For example, even though 
the string 


"ABC (DEF '(' GHI) JKL" 


is kalanced in the syntax of SNOBOL4, BAL would not match it. 
Since most languages have the capability of permitting quoted 
expressions within an expression, this severely hinders’ the 
application of BAL. 


Analyzing languages which have bracketing other than, or in 
addition to, parenthesization also presents a situation in 
which BAL is inadequate. For example, suppose that a list of 
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arguments (expressions separated by commas) is contained in 
the string LIST and suppose that its initial left parenthesis 
were removed. For example 


LIST = '13, A + B(3,4), C)' 


In order to pick off arguments from such a list, we may think 
of using the pattern matching statement: 


LIST POS(0) BAL . ARG ANY(',)') = 


Aside from the problem of quoted literals this statement will 
work correctly only if the source language contains no other 
kind of bracketing. For example, if the source language were 
SNOBROLU and if LIST contained: 


LIST = '13, A + BC3,4>, C)' 


the pattern matching statement described above would find 
' A + BX3' as second argument which of course is incorrect. 


The function BAL(PARENS,QTS) will return a pattern which will 
match all nonnull balanced strings where the first argument is 
used to specify paired brackets in nested fashion and the 
second argument specifies characters used as quotes. For ex- 
ample BAL(' (<>) ',"*" tty will match a balanced string in 
SNOROL4 source. Also BAL('()') is equivalent to the built-in 
pattern BAL. 


Let us consider how we might define the built-in pattern BAL 
if it did not exist before proceeding to the more general 
case. BAL is a pattern which will match any string balanced 
with respect to parenthesis. A balanced string is defined as 


1. Any Single character not a parenthesis is balanced. 
2. If B is balanced or is null then '(' B ')' is balanced. 
3. If B, and Bs are balanced, then B, Bs is balanced. 


A straightforward translation of this definition could be used 
to define BAL and it would have the appearance: 


BAL = NOTANY(')(') | '(* (*BAL | NULL) ')' | *BAL *BAL 


The difficulty with this rendition of BAL is twofold. It uses 
the stack heavily (even when there are no parentheses in the 
subject) and it is inefficient especially if it is headed for 
failure. The difficulty in both cases is the third alterna- 
tive. As discussed in the previous chapter, there are two 
kinds of stack usage that we must be concerned with. There is 
the relatively mild requirements of the alternatives which 
must be placed on the history stack; then there are the more 
severe requirements of recursion. This version of BAL uses 
the recursion stack quite heavily. Consider the match 


‘(XXX +... X)' '(' BAL ')' 


where there are N X's in the subject string. The maximum 
recursive level is N-1. What's worse, if the pattern BAL does 
not succeed as in 


' (XXX ... X! ' (* BAL ')' 


the time required rises exponentially with the length of the 
subject. 


Another approach to encoding BAL is as follows: let GBAL match 
only the first balanced string (as opposed to all ‘balanced 
strings). Then express BAL in terms of GBAL. 


GBAL 
BAL 


NOTANY (')(') | '(* (*BAL | NULL) ')' 
GBAL ARPNO(GBAL) 


This reduces BAL to sequential application of GBAL's and the 
time to determine failure does not rise exponentially. There 
is still the problem that the amount of stack used rises 
linearly with the length of the subject. Though this time, 
the stack used is the history stack and not the recursive 
Stack. An alternate-cursor pair is laid down at each nonparen- 
thesis scanned in the subject string. As this may be distur- 
bing for large strings a better tactic is to reverse the order 
Of alternation in defining GBAL as follows: 


GBAL = '(' (*BAL | NULL) ')' | NOTANY (') (') 


There is a time-storage tradeoff here. While this version of 
GBAL consumes less stack, it requires slightly more time in 
the event that the pattern is to succeed. We will opt for 
reduced stack usage. 


Another problem associated with writing the BAL function is 
how do we return a recursively defined pattern from a func- 
tion. Consider the function F(P) which attempts to return a 
pattern to match a sequence of P's. 


DEFINE ('F (P) *) : (F, END) 
F F = P *F | NULL : (RETURN) 
F END 


F returns a pattern whose definition depends on the current 
value of F. But Lord knows what the value of F is after the 


return. It can be anything, since the old value of F is 
restored. Moreover, even if a global name were used, the name 
would be reassigned a new value each call. A way to avoid 


these problems is to create a unique name at each call. Assume 
for the sake of argument that F1876 is such a unique name. 
Then if 


F1876 = P *F1876 | NULL 
F = F1876 : (RETURN) 
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were executed, the desired value would be returned. Code such 
as this could be created dynamically via the CODE function. A 
more efficient technique is to convert the unique name to 
EXPRESSION. This is done in defining BAL. 


DEF INE (' BAL (PARENS, QTS) Q, GBAL, NAME, STAR, LP, RP‘) 
: (BAL_END) 

GO ee A eae eae Caen MM M MU CM ON at aw A es ee ge oe 
| Entry point: Create a unique but uncommon name (NAME) for | 
| a variable which is to be assigned the pattern. To use it | 
| recursively, we will need the associated  unevaluated  ex- | 
| pression (STAR). Also initialize GBAL. | 
ur —————————————————— ————— —»———————K  ——————— ES | 
BAL NAME = ‘BAL .' &STCOUNT 

STAR CONVERT (NAME, ‘EXPRESSION') 

GBAL NOTANY (PARENS QTS) 


E Ic ccc E GS y A E E REEN E N AT S SI N a, O e ESR G EEEE A a 
| Loop on quote characters inserting a quoted literal as an | 
( optional condidate for a balanced string. | 
| ———— ———— ————— ———— "——  T—— ————————————————H—s———— | 
BAL 1 QTS LEN(1) . Q = :F (BAL, 2) 

GBAL = Q BREAK(Q Q | GBAL : (BAL, 1) 
| Brass EPI IM ELEC LC DEDI C ECC KG X NC E | 
| Loop on the nested bracketing characters and create a | 
| balanced alternate for each pair. | 
A A | 
BAL_2 PARENS LEN(1) . LP RTAB(1) . PARENS LEN(1) . RP 
+ :F (BAL 3) 

GBAL = LP (STAR | NULL) RP | GBAL : (BAL 2) 


E te MEC ICI AA AE. 
| Define BAL (the returned string) in terms of GBAL and as- | 
| sign it to the strangely named variable so that recursion | 


| works. i 
ASS | 
BAL_3 BAL = GBAL  ARBNO (GBAL) 

$NAME = BAL < (RETURN) 
BAL_END 
Epilogue 


Note that the name of the function is the same as the name of 
a built-in pattern BAL. Both the variable and the function 
can co-exist and can be entirely unrelated. Note that when 
the function is called the variable BAL is temporarily as- 
Signed a null value and is subsequently assigned the return 
value. Upon return, the original value of BAL is restored so 
no difficulty ensues. 


(tI 

(| Program || A criticism that could be leveled against 
E 8.4 E the BAL function is that the pattern it 
(| FASTBAL || returns creeps along, one character at a 
O AAA time, at speeds determined by 
ARBNO (NOTANY () ).. A much faster version can be written which 


will skip over uninteresting characters at BREAK speeds and 
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stop only before parens, quoted-literals and any of a set of 


designated characters provided as a third argument. For 
example 
SNOARG = FASTBAL('(<>)', '"' nen, !*,)' ) . ARG ANY(',)*) 


will assign to SNOARG a pattern which can be used to scan for 
the arguments of a function call in SNOBOLU source. If the 
string to be scanned is 


A 'B' + F(') '), X) 


then SNOARG will tentatively match "A " and then "A 'B'! + Ft 
before finally matching "A 'B' + F(')')". FASTBAL, like 
BREAKX, will continue to take extensions. For example, the 
pattern match 


'A/B(/D)/D' POS(0) FASTBAL('()',,*/') '/D' 
will succeed with the entire subject being matched. 


Like BREAKX and unlike BAL, FASTBAL will not match the entire 
string since it requires a break character. Such a modifica- 
tion, however, is easily made and is explored in an exercise. 


DEFINE ('FASTBAL (PARENS, QTS, S) NAME, IBAL, SPCHARS, ELEM! 
+ ', LPS, Q,LP,RP') : (FASTBAL_END) 
A A AN 
| Entry point: NAME is a uniquely created name for the | 
{| variable that will eventually hold the returned pattern. | 
| IBAL is a pattern to match balanced strings on the in- | 
| terior of brackets. | 
A ee eins DO O A | 
FASTBAL NAME 'FASTBAL ' &STCOUNT 

IBAL CONVERT (NAME, 'EXPRESSION!) 

IRAL DIFFER (S, NULL) FASTBAL (PARENS, QTS) 


SARA EN ARANA ARABES 
| SPCHARS are all the special characters. ELEM is a monic | 
| pattern to match a balanced string to be built up during | 
{ the subsequent computation. | 
| —€——— —————Á————À Y —R——S OCURRE ML E II NEHME RIBUS IN MESE DRIVER IN | 

SPCHARS = PARENS QTS S 

ELEM = NOTANY (PARENS QTS) BREAK (SPCHARS) 
rn eg Ee ae ee ge A 
( Loop on quotes, oring in a quoted literal pattern for | 
| every quote. | 
A E O A II A AS | 
FASTBAL_1 QTS LEN(1) . Q = :F(FASTBAL 2) 

ELEM = Q BREAK(Q) Q | ELEM :(FASTBAL, 1) 


AAA ew E MEME MGLM MCN ICI KC 1C ILI MEE a  e 
| Loop on parens, oring in a balanced form for each pair. | 
| ————————————— ——— — — —— —À—— Á!—— — ——» "(s Num Pot REN 
FASTBAL 2 PARENS LEN(1) . LP RTAB(1) . PARENS 
+ LEN(1) . RP :F(FASTBAL 3) 

ELEM = LP IBAL RP | ELEM : (FASTBAL 2) 


Se Sis ad A L LA AA L4 23 


E E ge NCC Eat ee ee AR CEDE 
{| Wrap things up and return. | 
AAA O ICI ———À——Y 
FASTBAL 3 FASTBAL = BREAK(SPCHARS)  ARBNO (ELEM) 

$NAME = FASTBAL : (RETURN) 
FASTBAL_END 


A 

(| Program |i The function NOT (P) returns a pattern which 
E 8.5 11 will match the null string provided P would 
N NOT E fail and will fail if P would succeed. 
 _ Á Az——————_———————————— NOT(P) is undefined if P is nonlinear. As 


an example of the use of NOT assume we wish to write a pattern 
which will match a PL/I comment. The pattern '/*' ARB 'x/! 
will not do since it will match other things in addition to 
comments. For example it will match three strings in the PL/I 
statement below where only two are comments. 


GOUT /* GARBAGE OUT */ = GIN /* GARBAGE IN */ 

To match a comment we can write: 
'/*' ARBNO(NOT('*/*) LEN(1)) **/" 
Here the ARB is replaced by a pattern constructed from ARBNO 
which will match an arbitrary string not containing the sub- 
string  '*/'*. TO speed up the search for the closing '*/' we 
can employ BREAK as follows: 
'/*'  ARBNO(NOT('*/') LEN(1) BREAK('*'))  '*/' 

The function NOT is so constructed as to be embeddable in it- 
self. Thus NOT(NOT(P)) will match the null string if P would 


succeed. Also if C were the comment matcher defined above, 
NOT(C) would operate correctly. 


One drawback of NOT, which is the reason we will not use it 
more widely in building other patterns, is that it must be 


used in FULLSCAN mode. The reason for this is the one- 
character assumption of the recursive reduction heuristic 
described in the previous chapter. Since mode switching is 


generally poor programming practice, we will generally avoid 
the use of NOT. 


* 


| aC ee we ee ge CDM IC"! cL KC D DEL LANCE ME AAA. | 
| NOT(P) will return a pattern which will match the null | 
| string if P fails and fail if P matches. If P aborts, | 
( NOT(P) will also abort. | 
| ——————— — "qa Re —— —"H—————— 'áCá— ———— —— ——9M—— 
DEFINE ('NOT (P) *) : (NOT. END) 
NFPA A CDL DECIES MMC A I ECLOG A RS CILE MEM C MC NCC C ERIE, | 
| Entry point: Return a pattern which pushes null onto the | 
| stack and replaces it with nonnull only if the pattern | 
| succeeds. The flag is eventually popped and tested by the | 
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| alternative. | 

eme E are ———————————————————————— 

NOT NOT =  *PUSH() P *(POP() PUSH(1)) FAIL | 

* *IDENT (POP ()) : (RETURN) 

NOT FND 

Names referenced Name Type Where defined 

by NOT: PUSH Function Program 5.5 
POP Function Program 5.6 

Epiloque 


P is assumed not to have side effects which will alter the 
stack. For example, if 


P - NULL | *(POP() PUSH()) FAIL 


then P will cleverly undo what NOT was trying to do and cause 
NOT(P) to succeed where it should always fail. But this 
amounts to almost deliberate meddling. If P uses the stack 
normally (i.e. leaving its state the way it was found) then 
NOT will operate correctly. 


€ 77 ee 
II Program || ONCE () returns a pattern which will succeed 
E 8.6 BE once and only once and thereafter fail 
E ONCE B forever. For example the pattern matching 
LA A À statement 

'AAAB! 'A' ONCE() 'B! | 'B' 


will result in the 'B' being matched, but not the 'AB', since 
the first time through the left alternation, 'BR' failed, in- 
dicating that that path could no longer be taken. Note that 
ONCE() must return a new and distinct pattern on each call 
since once it is used it can never be reused. 


ONCE() is similar to FENCE in that it matches the null string 
initially. Unlike FENCE, however, failure in subsequent tries 
is like FAIL (as opposed to ABORT) which permits other alter- 
nates to be taken. 


GRECE LIMINE CU IC IC cc A ICM IE SE CE MINIME LACE 
| ONCE() will return a pattern that will succeed just once. | 
a piv MU A SI UBER CRUEL 
DEFINE('ONCE(ID) NAME!) : (ONCE END) 

| foa dp CES IDCM CIN NIE XII NACL II CHIC MCI MIS ICA C d ELE RCM pe A AN 
| Entry point: If the argument is null we return a new pat- | 
| tern equal to *ONCF(id) where id is a unique integer. | 
Tn — O——— —————"— v — —Á—————Á— O | 
ONCE ONCE = IDENT(ID,NULI) 

+ |. CONVERT('ONCE(' &STCOUNT ')' , 'EXPRESSION') :S(RETURN) 


rcu aR SSR C DEI CE IC INCID A a op 
( Otherwise compute a name based on the unique ID. Return | 


A A A AAA A O AX S et ce eee cee UNE 


| its value. It will be initially null. Set it to FAIL for | 


| all subsequent calls. | 
A A Se C EE ee ce cel 


NAME = 'ONCE..' ID 

ONCE = $NAME 

$NAME = FAIL : (RETURN) 
ONCE_END 
Epilogue 
the function ONCE() returns an expression of the form *ONCE (n) 
which will succeed just once and fail forever after. It il- 
lustrates several principles. First, a function can return 
different patterns and each of these patterns can vary their 
Own behavior with time. Second, the function serves both to 


return a pattern initially and is also the function invoked 
during the match. Both of these operating principles will be 
in use in the next function. 


The technique used to encode ONCE() can be used to pick off 
the first match of a pattern and thereby increase efficiency. 
See Exercise 8.8. 


qe 

(|! Program || TEST is designed to alleviate some of the 
11 8.7 N problems involved with the one-character as- 
E N sumption which we have already indicated 
L— 4 might be a source of difficulty with the NOT 
function. TEST will accept an unevaluated expression as argu- 
ment and return a pattern. When the pattern is encountered by 
the scanner during a pattern match the original unevaluated 
expression will be EVALed and the pattern will succeed or fail 
depending on the outcome of the EVAL. If it succeeds it mat- 


ches the null string. For example 
TEST (*LGT (A, B) ) 


will return a pattern which, during pattern matching, will 
succeed or fail depending on whether A is, or is not, lex- 
ically greater than B. 


Thus TEST(exp) acts like exp. It differs from exp in that its 
minimum length will be 0 as opposed to 1 and it will match the 
null string if the evaluation succeeds. 


DEFINE (' TEST (ARG) NAME !) : (TEST END) 


Cs ee re QC DID DC eS Pe ee ee oc ac ae RE a | 
| Entry point: If ARG is an EXPRESSION we will return a | 
( pattern. The expression is saved in a unique name (NAME) | 
| and this name, in the form of a string, is used as an ar- | 
{| gument on subsequent calls to TEST. | 
| eu ———————— ———— ———————O————À— Á —— JN NE ANE UNE A RR | 
TEST IDENT(DATATYPE(ARG),'EXPRESSION!)  :F(TEST,. 1) 
NAME =  'TEST ' §&STCOUNT 


Page 156 Chapter 8 - PATTERN CONSTRUCTION 


$NAME = ARG 
TEST = EVAL("NUIL $ *TEST('" NAME "*)") : (RETURN) 


E say ee NR E ge A ery MEC NOM M Pm a nine Oye ane a A | 
If ARG is not an EXPRESSION we presume that we are dealing 
with one of those subsequent calls to TEST. In fact, we 
can conclude that we're in the middle of a pattern match. 
Retrieve the old expression and evaluate it and return a 


dummy name. 
|o —— P——————————— ———————— A E AS | 


TEST 1 TEST = ?EVAL($ARG) .TEST_ : S (NRETURN) F (FRETURN) 
TEST END 

(DRE a ee A 

{! Program || LIKE(S) returns a pattern that will match a 
l1 8.8 11 string like the one passed as argument. A 
E LIKE E like string is defined as anyone differing 
LÀ à from the argument by a) a rearrangement of 


two characters, b) the deletion of a character or c) the 
insertion of a character. 


DEFINE('LIKE(S)C,T1,T2,N!) : (LIKE, END) 
Wp ee A ee ee eg UTEM RE mr UN Le a QU Ls saw 
| Entry point: Make sure that S itself is regarded as LIKE | 
S. | 
| e — C —————— o————ÓÓ— ——— ——————————Po—— —— RÀ RS | 
LIKE LIKE = S 
EGG MADE x RICE E eee DM EM a M CC M CCP Se 
| Loop on N where N denotes a cursor position within S&S. | 
| Split S into two parts, T1 and T2. | 
| A A IS | 
LIKE_1 S  TAB(N) . T1 REM. T2 :F (RETURN) 
N = N+ 1 


[o o e ua IM UP EU ee ULT x EU UE TIS be ees ri 
| First OR in a pattern which matches S with one character | 
| inserted at position N. | 
UNS ANEHN ee A EIER EMEN De NRI O IA 


LIKE = LIKE | T1 LEN(1) T2 


AGA ee MEME MEL ee ee I ee ee ee 
| Then OR in the pattern which matches with one character | 
| deleted at position N. | 
NN | 
T2 LEN(1) . C = : F (RETURN) 
LIKE = LIKE | T1 T2 


NS MN ne nee CC E EE Ee ene AAA 
| Then OR in the pattern where the two characters at posi- | 
| tion N have been rearranged. | 
| SA E O O A A OS 
T2 POS(1) = C :F(LIKE_1) 
LIKE = LIKE | T1 T2 : (LIKE, 1) 
LIKE END 
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aa, | 

li Program |i OR(S) is intended to form the OR (in the 
li 8.9 E pattern sense) of several strings contained 
li OR {| in S. For example OR(',ABC,DEF,XYZ') IS 


a me EQUIVALENT TO 

"ABC! | 'DEF'! | 'XYZ' 
The initial character (in this case a comma) is used to 
separate elements. For efficiency puroses, OR will factor out 
like initial characters. Thus 

OR(',ABLE,ACTOR,ANCHOR,BAKER, BULL!) 
is equivalent to 
'A' ('BIE' | 'CTOR'! | 'NCHOR!) |  'B' ('AKER'! | 'ULL!) 
The resulting expression in this example is over twice as fast 
as alternating 5 strings since for most subjects only 2 checks 
are needed for every pre-cursor position as opposed to 5. The 
initial character extraction is done to arbitrary levels so 
that 
OR(', ABC, ABBOT, ACTOR, BAKER!) 
will return 
'A! (!'B' ('C' | 'BOT!) | 'CTOR!) | 'BAKER' 

For efficiency purposes, if a factored character contains only 
one branch, the character is combined with the head of the 
branch. Thus 

OR(',ABC,ABBOT,BAKER!) 
returns 

"AB ('C* | 'BOT!) | 'BAKER' 

Characters in parenthesis imply an ANY-like construction. Thus 

OR( ',C(AO) D,C (AO) ST!) 
will return 

'C'* ANY('AO') ('D' | 'ST!) 


Several examples of the use of OR are given in the initializa- 
tion section of HYPHENATE (Program 10.7). 


( OR(LIST) will return the alternation of the substring of | 
| LIST separated by the break character determined by the | 
| first character in LIST. Parenthesized strings are | 
| regarded as ANY. | 
AA A CE Y 


DEFINE ('OR (LIST) BC, SEIZE,ANC') 


Ng a A A ON 
| OR_EXTRACT() is a function used by CR to extract from the | 
| global variable LIST, the substrings beginning with the | 
| same first character (or parenthesized expression). | 
a i Roco TEE AEE E E E E EEEE R E E | 

DEFINE (*OR_EXTRACT () COMMON, IC, P, SUBLIST,T,TLIST,C1,C2') 


: (OR_END) 
RRA RANAS | 
| Entry point for OR. Determine the break character and | 


| define a pattern to be used throughout to SEIZE all up to | 
| the next break character. Define ANC as a pattern to | 
| anchor the scan and match the Break Character. | 
A O O V — — O II IN E IA | 


OR LIST LEN(1) . BC : F (FRETURN) 
SEIZE = BREAK(BC) | REM 
ANC = POS(0) BC 


E O a ee 
| Or together all extractions. | 
A A IN ———— A — ES A A AA 
OR = OR_EXTRACT () 

OR_LOOP OR = OR | OR_EXTRACT() : S (OR. LOOP) F (RETURN) 


Entry point for OR EXTRACT(): Set TIIST to bea copy of 
LIST. Extract initial character (IC) and set COMMON equal 
to the first substring. If this pattern fails, no IC could 
be found. This means that LIST is either empty in which 
case we fail, or contains only BC in which case we return 
the null string. Both of these cases are important since 
the former terminates the loop in OR() and the latter 
breaks the recursion of OR EXTRACT(). 

Se ce a ——— ———— — — —— ——————— —— —()u——€ A—Ü—MRÜ' 


OR EXTRACT 
TLIST = LIST 
LIST ANC (BAL . IC SEIZE) . COMMON : S(ORX_1) 
IDENT (LIST, NULL) : S (FRETURN) 
LIST = NULL : (RETURN) 


A A A Se 
| Find the largest COMMON prefix contained in all strings | 
| beginning with IC. | 
a se ee IN | 
ORX_1 TLIST ANC IC : F (ORX, 3) 
ORX_2 TLIST ANC COMMON SFIZE = :S(ORX 1) 


ATREA KL IM CC CMS E CIC LEM MCENED 

| COMMON was not there. Reduce COMMON by one character and | 

| try again. This means extract the last balanced string of | 

| COMMON. i 

Oe a e e d 
BALREV (COMMON) BAL REM . COMMON :F(ERROR) 
COMMON =  BALREV (COMMON) : (ORX_2) 
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Ge ee E AMI INC CIS IE CM C M C I E ee ee ee IMP CREDE: | 
( Now remove the COMMON characters from each string as we | 
| prepare a SUBLIST to be OR'ed. | 
AAA E DARE MEER T REUS E EFE EE 
ORX 3 LIST ANC COMMON SEIZE . T - :F (ORX_4) 
SUBLIST = SUBLIST BC T : (ORX, 3) 

a agg a IL MED DISC E a E MOM at ACCES 
| Convert any parenthesized expression in COMMON to an ANY. | 
| Build up the pattern in a temporary P. Then join this with | 


| the result of a recursive call to OR. | 
| EEEE Be EM MEER IS A RE RE E NE E REIR B O REDE A | 


ORX_4 COMMON BREAK('(') . C1 | (' BREAK (')') . C2 
* ey, = ¿F(ORX_5) 
P = P C1 ANY (C2) : (ORX 9) 
ORX 5 OR EXTRACT = P COMMON OR(SUBLIST) < (RETURN) 
OR END 
Names referenced Name Type Where defined 
by OR: BALREV Function Program 3.8 


li Program E This pattern is intended to match a PL/I 
E 8.10 (| Statement (assigning to STMT the string 
I1 PLI.STMT |l matched) and to fail if none exists. The 

—————4 presumed scenario is that a program is 


reading lines of a PL/I program and continues to apply the 
pattern until it succeeds in matching a prefix of the combined 
input lines. The pattern need not check for syntactic correct- 
ness of the input and hence it will be sufficient to check 
for the presence of a semicolon provided this character does 
not appear within quotes or comments. 


NEC ECC CET ee ay ee ee CK c CC ee OER CÓ E Cy ae CELUM CD M rg” ya M LEE EC ae 
| Define an FLEM as a quoted literal or a comment or a non- | 
{ null sequence containing neither a semicolon nor a comment | 
| or quote delimeter. | 


Q = ein 


QLIT = Q FENCE BREAK(Q) Q 

CMNT = 1/*! FENCE ARB '*/'! 

ELEM = QLIT | CMNT | LEN(!) BREAK('/;' Q) 
Eoee a a ee ge IC MID NI A a RO. 
| Use back-up-free scanning (Chapter 6) to search for the | 
| statement. | 


a a —M———————m————————— —————— Adddw————— À 
PLI.STMT =  POS(0) (ARBNO(ELEM FENCE) ';') . STMT 

[oU ELATUM 

{{ Program || Many problems involving the processing of 

11 8.11 N assembler source can be conceptually simple 

Ii ASM360 N and yet provide a challenge to the program- 

————— mer. Consider the problem of reformatting 


the source so that various syntactic parts such as operations, 
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operands and comments are set to allign at pre-determined card 
columns. The heart of this problem as well as many others is 
simply the extraction of the various fields since once these 
have been obtained it is a relatively simple matter to recast 
a given line in a new format. Different assembler languages 
offer different problems to be solved. The OS assembler 
(IBM360b] is noted for its relative ubiquity and complexity 
and will offer a fine example to consider. 


In the OS assembler there are four fields separated by blanks, 
viz. 


NAME OPERATION OPERAND COMMENT 


where the optional NAME field must begin in column 1 if it 
exists. One is tempted to use BREAK(' ') to separate the 
fields. This works for the first two fields but the operand 
field may have blanks embedded in quoted literals and so this 
simple scheme will not do. Moreover, the quote that appears 
in an expression beginning with L' is not to be considered for 
quote-balancing. Thus 


L MVI 3,L'ABC 'THIS IS A COMMENT' 


has an operand field (3rd field) that breaks after ABC and not 
after THIS. The rule for determining whether L' is to be 
considered specially is given on p. 71 of (IBM360b] 


"An apostrophe not within a quoted string 
immediately followed by a letter and immediately 
preceded by the letter L (where L is preceded by 
any special character other than an ampersand) is 
not considered in determining paired apostrophes." 


On page 10 of [IBM360b] we obtain the definitions of 'letter' 
and 'special character' and so we begin coding ... 


LETTER 
SP.CH 


'ABCDEFGHIJKLMNOPORSTUVWXYZ$$9' 
"a-,-.*()'/&E " 


| rcp MCCC ye ep ee ee Oe MEC LIC MN CI DECIMI HMM CMM CMM CN CM CO CA ME ee - 
| From this we obtain ‘special character other than | 
| ampersand’ which we will call SCOTA. | 
a a a ee a a E EE | 
SCOTA = SP.CH 
SCOTA rg. = 
Pe a tae CU gs gee IC EC E 
| We consider the line decomposed into disjoint elements | 
| where each element is either (in order) a quoted literal, | 
| an L' construct, a single SCOTA or a sequence of | 
| non-SCOTA's. | 
pola A A A edi 


Q = ten 
QLIT = Q FENCE BREAK(Q) Q 
ELEM = QLIT | 'L' Q | ANY(SCOTA) | BREAK (SCOTA) | REM 


NM MSIE CIVEM ME CN CM M MC IMMER | 
| From this we may use back-up-free scanning to define the | 


Program 8.11 - ASM360 | Page 161 


( operand field (F3). B is used to separate fields. The | 
| first two fields according to p. 8 of [IBM360b] are ter- | 


| minated by blanks (or the end of the line). | 
LI——————— SSE LS O SEE E WR Qu dd M EE ) 


F3 = ARBNO(ELEM FENCE) 

B = (SPAN(' ') | RPOS(0)) FENCE 
F1 = BREAK(' ') | REM 

F2 = F1 


poc —————————MÁHr———————————————————————— — 

| To further complicate the issue, if the operation is one | 

( of a class of conditional assembly operations defined on | 

| p. 75 of [IBM360b] as: | 

¡AAA IES EE A ES II IS SI NI E M IS TS SS A | 
CAOP = ('LCL' | 'SET') ANY('ABC') | 

+ "AIF? | ! AGO! | * ACTR' | *ANOP * 


| then the operand is a conditional assembly operand. For 
| such operands the number of ways of using the quote 
| character in unbalanced situations is increased. For ex- 
| ample T'NAME refers to the type attribute of the symbol 
| NAME and the quote here is not to be considered as one of 
l a pair of balanced quotes. The set of attributes is given 
| by the pattern ATTR. 

—————— P QOO——————————————''Óáu———— ——————— ÁÁ— — — A E| 


ATTR = ANY('TLISIKN') 


Moreover, the operations SETB and AIF permit ‘logical ex- 
pressions enclosed in parenthesis'. Logical expressions 
may contain blanks so we must ignore any blanks contained 
within paired parenthesis. Of course we must ignore any 
parens within guotes and we must continue to ignore quotes 
which occur merely as part of an attribute. Since it can- 
not hurt to ignore blanks within parens in any of the con- 
ditional assembly operations we can treat all of them 
uniformly. ELEMC is an expanded form of ELEM permitting 
the additional attributes and the parenthetical groupings. 
F3C will match an operand field (field 3) if the operation 
is a conditional assembly. 
Lera 

ELEMC = !(* FENCE *F3C !)' (| ATTR O | ELEM 

F3C = ARBNO(ELEMC FENCE) 
Gg ig a A TV UE ESI INN CTI M RETE ME EET Re i ee Ee | 


{ Putting it all together: l 
Pn PM c A T EE EEEE ALT AA T APEERE A AA EE AATE. 


ASM360 = F1. NAME B 
+ ( CAOP . OPERATION B F3C . OPERAND | 
+ F2 . OPERATION B F3 . OPERAND) 


+ B REM . COMMENT 


Qo Fee E S S A 

| Exercise 8.1 | Assuming S is nonnull, rewrite BRKREM(S) as 
t——— a single expression involving only (but not 
necessarily all of) LEN, POS, RPOS, SPAN, BREAK, ANY, NOTANY 
and ARBNO. 


A ES 

| Exercise 8.2 | Write a version of SPAN(S) (call it SPANULL) 
3 Which will match the null string in the case 
that SPAN(S) would fail. Otherwise, SPANULL(S) should behave 
exactly like SPAN(S). Thus SPANULL(S) must be monic. This 
can be done in several ways. Try it a) using NOT(P), b) using 
PRKREM(S) and c) from scratch. 


(MMC 

| Exercise 8.3 | Modify BREAKX (call it BRKXREM) so that it 
t-————————————-4 will match the remainder of the subject 
string as its last extension. Thus 


'A,B,C'  POS(0) BRKXREM(',') $ OUTPUT FAIL 
will print 'A', 'A,B' and 'A,B,C'. 


qp ee ety TN 
{ Exercise 8.4 | Which of the following assignments would 


tL———————————4 also be valid ways of implementing 
BREAKX(S)? That is, which of the statements below, if sub- 
stituted for the one statement in Prog. 8.2, will produce a 
correct rendition of BREAKX? 


BREAKX =  ARBNO(BREAK(S) LEN(1)) BREAK(S) 

BREAKX =  BREAK(S) (NULL | LEN(1) *BREAKX) 

BREAKX = ARBNO(LEN(1) BREAK(S)) BREAK(S) 

BRFAKX = BREAK(S) (NULL | LEN(1) BREAKX(S)) 
Ca ae DEDE IRESM 
| Exercise 8.5 | Given the subject, "AB(C,D')E')GH", which 
LLLL——————————3 values of pre-cursor position will the 
pattern 

BAL('()' , "'") ANY (',) ') 

match? 
oe eee ee 
| Exercise 8.6 | Let RULE be string-valued and contain the 
(VS rule of some SNOROL4Y statement (i.e. the 
statement without the label and goto fields). Assume the rule 
is trimmed of leading and trailing blanks. Write code to 


determine the type of SNOBOLU statement and branch to one of 


the following labels: PM for pattern match, PMR for pattern 
match with replacement, ASGN for assignment and EXP for none 
of the above (Hint: Using the BAL function, this will require 
one pattern assignment and three pattern matches). 


Be ee! ee tN 

| Exercise 8.7 | The author once comitted an error similar to 
3 the following. Assume that to create a truly 
unusual name the first statement of FASTBAL (Prog. 8.4) is 
changed to: 


FASTBAL NAME = 'FASTBAL ' ESTCOUNT 


Surely, vanishingly few identifiers contain blanks and the 
&STCOUNT makes it that much more unusual. Why is this an 
error? 


[OUT St A ae TN 

| Exercise 8.8 { Write a function FIRST(P) which will return 
t-——————————-—4 a monic pattern whose post-cursor position 
is the first post-cursor position yielded by the pattern P. 
Note that unlike ONCE(), FIRST(P) should be reset at each cur- 
sor position. 


i. | 

| Exercise 8.9 | What is *ONCE() equivalent to ? 

ee AS | 

ee 

| Exercise 8.10 | Write a function NTIMES(N) which will 


t——— return a pattern which will match the null 
string exactly N times and thereafter fail forever. 


f$ eroi 

( Exercise 8.11 | Write a function IF(P) which will match the 
t——————————————4 null string if P would succeed and will 
fail if P would fail. (Hint: you may use functions defined in 
this chapter). 


A | 

| Exercise 8.12 | Let the SIZF of a string S be L. How many 
A alternates will LIKE(S) have (Prog. 8.8)? 
Modify LIKE so that it uses OR (Note:  ANY(SALPHABET) can be 
used in palce of LEN(1)). How many principal alternates will 
LIKE then have (assume that S contains at least 3 characters 
and that the first two characters are different)? What is the 
fewest number of principal alternates that LIKE could have? 
Rewrite LIKE to obtain that many. 


Ce een en ee 
| Exercise 8.13 | Modify LIKE(S) (Program 8.8) so that, in 
l———— addition to insertions, deletions and rear- 
ragements, any string differing from S ina single character 
will be matched. 
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quu———Ó——Ó 

( Exercise 8.145 | LIKE will tolerate just one error.  Rewrite 
t————————————3 LIKE so that it will tolerate K errors 
(Hint: Rewrite LIKE recursively). 


a 
{ Exercise 8.15 | What character(s) could not be used as a 


AAA break character for OR? 


ery re ete gs ae ee 
| Exercise 8.16 | To allow for really rapid scanning for a 


t—— set of strings, modify OR(S) so that it 
returns 


BREAKX(S1) OLD OR (S) 


where OLD OR is the OR function defined in Prog. 8.9 and where 
S1 is derived from the argument S. 


qmm 
| Exercise 8.17 | Rewrite PLI.STMT so that it does not use 
CA md FENCE but NOT instead. 


[DI —————— DOTT 

( Exercise 8.18 | Find a subject for which PLI.STMT will 
(AS behave incorrectly if any of the following 
changes are made. 


(a) removing the FENCE from QLIT 
(b) removing the FENCE from CMNT 


(c) removing the FENCE in the argument to ARBNO. 


Coe ee ee eee 
| Exercise 8.19 | A telephone information service operates by 


AS the user dialing (or touch-toning) a 
party's name using the letters that appear on the dial. This 
does not uniquely specify a string of letters since each digit 
has a group of 3 characters associated with it as follows: 


ABC - 2 PRS - 7 
DEF - 3 TUV - 8 
GHI - 4 WXY - 9 
IKJ - 5 Z - 0 
MNO - 6 


Write a function called NAME which accepts as argument a 
string of digits and will return a pattern which can be 
matched against all names in a directory. The pattern should 
be of the form ANY() ANY() ... ANY() where there are as many 
ANY's as there are characters in the string. (Hint: the body 
of the function requires only 3 relatively simple statements.) 
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md 


I LEE eo UN 
| Exercise 8.20 | Assuming that LEN(N) can have negative ar- 


C-— ——————————-A  guments we could make a rapid search for 
the least likely character of a string using BREAKX. For ex- 
ample, to scan for ‘EXAMPLE!’ in a string of text, it would in 
general be more efficient to use the pattern 


BREAKX('X') LEN(-1) 'EXAMPLE' 


than a BREAKX('E') construction because of the low frequency 
of the letter 'X' in English text compared with 'E'. Write a 
function called SEARCH(S) which will return an optimal pattern 
in the above form for searching for the string S. Assume that 
S contains only alphabetics and that the letter frequency is 
that of English, viz. 


FREQ TBL = 'ETOANIRSHDLCWUMFYGPBVKXQJZ' 


(Interesting note: The least-frequent character can be deter- 
mined in one statement by a simple scan.) 
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READ 


FORTREAD 


PARAGRAPH 


CONTENTS 


SNOREAD CA E EE EE SE SE EE SE EE EE E D O O O o 


TREEREAD 


MFREAD 


PUT - e € € € € «o € € € € « € € € « e * € e oe 


FORTPUT 


PEEL 


SNOPUT 
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r^ 1 

(c-4|1 ne of SNOBOLU'S many assets is the simplicity and 
(|! {i directness of its I/O. One need merely mention the 
(i! tt variable INPUT in an expression and, automatically, a 
(t—| card (or card image) is read and the string of charac- 


is ters on the card is used as the value of the variable 
INPUT. Similarly, the mere assignment of a value to the 
variable OUTPUT or PUNCH will cause that value to be respec- 
tively printed or punched. 


In many cases, however, we want something slightly richer than 
this, as the following programs will illustrate. 


Goo ee eee 

(| Program || For many applications the basic input 
E 9.1 11 process is less than completely ideal. We 
E READ 11 often would like to read in a card, compare 
_Á— 3 it against a pattern, and, if the card was 
not what we sought, transfer to another section of the program 
which will read the same card from the input stream. Our aim 


could be realized if we had the ability to put something back 
on the input stream. This act is impossible in SNOBOL4 but it 
could be effectively done by writing a subroutine which could 
store things we ‘pushed! onto the input stream and yield them 
up when we sought to read. This we will not do (but leave as 
an exercise). We will create something which will be less 
general but simpler and, in most situations, easier to use. 
We will define a function called READ which will accept one 
argument, viz. a pattern, which will be matched against the 
next string on the input stream. If the pattern matches this 
string, the string will be returned. If the pattern fails to 
match, the READ function will fail but will save the string 
for the next time READ is called. In the several programs 
following this one, we will show how this property can be 
used. 


Another inadequacy with the basic input facility of  SNOBOLU 
has to do with file sequencing on the IBM 360/370. When no 
more input remains on the current input file, and an input re- 
quest is made (by a reference to the variable INPUT) the 
reference will FAIL (in the SNOBOL4Y sense of statement 
failure). If an input request is made after the initial 
failure, the next file in sequence will be opened. If this 
file is not present, the program terminates abnormally. 


Unfortunately, this is not what we want most of the time. 
Often, the reason several files have been placed in sequence 
is to make them appear to the program as one long file, an ap- 
pearance which is blemished if failures occur in between. Also 
we would like the liberty of making several read requests 
after the final failure without fear of blowing the program. 


READ will take care of this file sequencing problem. It will 
fail only after the last file has been exhausted and subse- 
quent calls thereafter will merely fail. 
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A 

( READ(P) will read in and return a card provided it is mat- | 

( ched by the pattern P. If there are no cards remaining or | 

Į if the pattern fails READ will fail. | 

IN | 
DEFINE ('READ (P) ') 

: (READ_END) 


Co ee ee ee MEAEEES. 
| Check to see if the number of files beyond the current is | 
| negative. If so return failure. 4 
A  ————ÀÁÁ——————— "má ———— O O O EE | 


READ LT (NF_INPUT, 0) : S (FRETURN) 


pa EE ey M M LM E MM RM IG NC CIC MM me ee ME COR GI CIC oy CIC PO E C ee gtd. CU AM CDM | 
| Fill the input buffer if it is empty. | 
A —— *—Ó——M——Q—A-——— -————— ———————— 'É———À]Àg 
IDENT(INPUT BUF, NULL) :F (READ 1) 
INPUT BUF = INPUT :F(READ 2) 
READ 1 


¿AAA AAA A MEC MMC IC IC MEC MC C D I c cpcaArE 
| Check the buffer for a successful match against P. If no | 
| match, then fail return. If match, then return the value | 
( in the buffer (INPUT BUF) and clear the buffer. | 
Sa S er SEENE ' ——— ESEE EA) 


INPUT_BUF P :F (FRETURN) 
READ =  INPUT BUF 
INPUT_BUF = NULL : (RETURN) 


SR gg E E E ey meen ae ge 
| If the attempt to read resulted in failure, then control | 
| passes to READ 2. Deduct 1 from the number of remaining | 
| files and transfer to label READ. If this number becomes | 
| negative, the function will fail continually. | 
—— —  —— MÀ —— — — Á—— J—— J———————m—nÀ9üeááM—À M9 | 
READ 2 NF INPUT = NF INPUT - 1 : (READ) 
RFAD END 


Epilogue 


The variable NF_INPUT (Number of Files on INPUT) is to be set 
equal to the number of files beyond the current one. Normally 
NF_INPUT is equal to 0 since the default value of variables is 
null (which numerically equals 0). Therefore, the programmer 
normally need not worry about its value. However, he may set 
this at any time during the running of the program if ad- 
ditional files remain. For example if a special marker is 
placed at the end of a file to indicate that this was not the 
last one in a sequence then the appearance of that marker 
could be used to trigger an assignment of the value 1 to the 
variable NF_INPUT. 


Ce ee 
Program Many string-processing problems involve the 


lI 1 

E 9.2 li analysis of the source language of some 
(| FORTREAD || other program. FORTRAN is perhaps typical 
— — — of the kind of language which we might wish 
to process. Examples include compilation (translation of 


FORTRAN programs for sematic errors not discoverable by the 
compiler), flow charting (describing diagrammatically the flow 
Of control), preprocessing (translation of an extension of 
FORTRAN into FORTRAN such as SIMSCRIPT [Dimsdale & Markowitz, 
1964], and conversion (translating a version of FORTRAN for 
one machine to a version suitable for another). In addition 
to these fairly complex undertakings, the processing could be 
some simple house-keeping chore such as converting every 
reference of 'ALPHA' to a reference to 'BETA!. 


When writing programs to analyze other programs it is usually 
wise to write a function whose only duty is to collect and 
return the next statement on the input stream and FAIL if no 
statement remains. The benefits of doing this are the same as 
those derived from subroutinizing one's program generally. It 
saves duplication of code, allows subdivision of labor, the 
program logic is easier to follow and the program is easier to 
modify and maintain. 


A card with a 'C' in column 1 is regarded as a comment card by 
the FORTRAN compiler. Comments may appear anywhere, even bet- 
ween a Statement and its continuation. These are ignored. A 
continuation , card is indicated by a nonblank in column 6. A 
blank in column 6 indicates the start of a new statement. 


E INE MRNA NCC OM EC ape QM MMC MC ILC MEM NECEM C OM DCN NC ERO CN M a ae 
| FORTREAD will read in and return the next FORTRAN state- | 
( ment on the input stream. | 
| are E A a —— ee ES | 

DEFINE (*FORTREAD () T!) 

INPUT (. INPUT, 5, 72) 

FORT COMMENT =  POS(0) 'c' 

FORT CONTINUE =  POS(0) LEN(5) NOTANY(' *) REM . T 

: (FORTREAD END) 
en C CK UC MMC M ICM MC RM CECI E AN I NC CCELI e MEL eee eee 
| First pass over any initial comment cards and then read in | 
| the first statement. | 
A A ee O ESE | 
FORTREAD READ (FORT_COMMENT) : S (FORTREAD) 
FORTREAD = READ() : F(FRETURN) 

NS | 
| Then pass over more comments (if any) and then look for a | 
{| continue card. If not found we return. But if found, the | 
| variable T will hold the desired value. This is tacked | 


( onto FORTREAD and we renew the search for a continue. | 
A A | 


FORTREAD_1 READ (FORT_COMMENT) :S(FORTREAD 1) 
READ (FORT CONTINUE) : F (RETURN) 
FORTREAD =  FORTREAD T : (FORTREAD 1) 


FORTREAD END 


Names referenced Name Type Where defined 
by FORTREAD: READ Function Program 9.1 
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The initialization section of FORTRFAD reassociates the 
variable INPUT with the first 72 characters of a card. In this 
way the identification field of the FORTRAN deck (columns 73 
through 80) are ignored. 


Two patterns are also set in this initialization section. The 
first pattern matches successfully any FORTRAN comment card; 
the second will not only match successfully a FORTRAN continue 
but will assign the 'meat' of any continue card to the tem- 
porary variable T. 


One may note the rather heavy use to which READ has been put. 
It is called at four separate places and has greatly sim- 
plified the writing of FORTREAD. The first call represents a 
rather conventional use of READ. "Give me the next card if it 
is a comment." It is in fact thrown away immediately. The 
second call of READ, which is made with no argument, makes use 
of the fact that a null string will be supplied by default. 
Since a null string as a pattern will always match, READ() is, 
in effect, an unconditional grat at the next string on the in- 
put stream. It can only fail if there is nothing left. 


Another use of READ is in the fourth call in the third last 
line of the program. This call not only tests the next string 
but causes a variable (T) to be assigned a subpart of the 
string. Patterns, in general, can denote arbitrarily complex 
computations with the subject string as effective argument. 
This property of patterns imparts to READ a high degree of 
flexibility. 


| Sx COMMON CE M C MUS: | 


11 Program E For many of the same reasons that we might 
li 9.3 (| want a FORTRAN statement grabber if we 
(| PARAGRAPH |i were processing FORTRAN decks, we might 
E A AMA97> want a paragraph grabber if we are proces- 
sing text. A paragraph, here, is assumed to be a sequence of 
lines down to the next paragraph whose start is designated by 
a blank in column 1. Since the information on the cards is 
assumed to be sentences, we will place a blank between lines 
(after trimming). Moreover, if a line ends ina period, we 


will place an extra blank between it and the succeeding line, 
Since it is conventional, in typing, to separate sentences 
with two blanks. If no paragraphs remain, or if the first line 
to be read does not match the pattern passed to PARAGRAPH as 
argument, then PARAGRAPH will FAIL. 


A ME IDE C OE pe KI MCI CCS C I M Rep AN 
| PARAGRAPH(p) will read in a paragraph provided the first | 


| card on input matches the pattern p. The paragraph is as- | 
| sumed to continue until a blank appears in column 1. It | 
( will fail if a paragraph is not found. | 
AA A ——— a | 


DEFINE (' PARAGRAPH (FIRST LINE) T,P*) 
PARA CONTINUE =  POS(0) NOTANY(' ') 
: (PARAGRAPH END) 


a SE LJ MCI eS M MC EM MCCC C C CIC QC M SM ee ees | 
| Read in the first line, provided it is the first line of a | 
| paragraph. If it is not, fail. | 
| —————ÉÁ—————  ——— á—— A —— S ———] À—————— ———— ——ÓUHRN| 


PARAGRAPH P = TRIM(READ(FIRST LINE)) : F (FRETURN) 


SS E E E O ge Ne re wim See ny re Se IDCM MEN CÓ EIC CN DCN r Oe 
| Set the variable T equal to 2 blanks or 1 blank depending | 
| on whether or not the paragraph accumulated so far (in P) | 
| ends with a period. | 
lo 
PARAGRAPH_1 T = '! 

P POS (0) RTAB(!) '.' : F (PARAGRAPH_ 2) 

T = 0 L] 
PARAGRAPH 2 


Now join the next input line provided it is still part of | 
the paragraph. Tf so, recycle; otherwise return what is | 
in P. Note that the blanks in T are not joined to P unless | 
the READ() is successful. | 
coL — PD————Á— ———Ó———————JÁ—— ————-JÀ:———— PHÓ | 


P = P T TRIM(READ (PARA  CONTINUE)) :S(PARAGRAPH 1) 
PARAGRAPH = P : (RETURN) 

PARAGRAPH END 

Names referenced Name Type Where defined 

by PARAGRAPH: READ Function Program 9.1 

Epiloque 


PARAGRAPH, like FORTSTAT, refers to the READ function to do 
its basic input. The pattern which defines what determines 
the start of a new paragraph (or more exactly the end of a 
current paragraph) is contained in PARA_CONTINUE. This pattern 
can be modified for slightly different paragraph conventions 
or can be set as an argument. 


Note that the temporary variable P was used to accumulate the 
material in the paragraph. The variable PARAGRAPH could have 
been used and this would have saved one assignment statement. 
P was used for brevity and convenience and with the knowledge 
that straight assignments of the kind indicated are quite fast 
and their effects on the running time of the overall program 
are negligible. 


[TT ee AAN 

If Program || For many of the same reasons that we would 
E 9.4 11 want statement-gathering activities to be 
(| SNOREAD || focused in one function in FORTRAN statement 
AAA processing, we would want to do the same if 
we were processing SNOBOLÚ. A complexity introduced in ob- 


taining SNOBOLU statements is the possibility of multiple 
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statements per line (separated by semicolons). Moreover, the 
fact that quoted literals may have semicolons embedded within 
them means that a blind search for a semicolon will not do. A 
further complexity is introduced by the fact that labels may 
have quotes embedded within them (only semicolons and blanks 
may not appear in labels) so that such quotes are to be 
ignored when ignoring semicolons within quotes. But we have 
encontered such problems in the preceding chapter and, by now, 
they should be routine. 


Like FORTSTAT, SNOREAD will ignore comment cards and fail when 
no more statements remain. 


rri A O E MIC MC C CENE ES qM ICM ECC MCN I D NU CMM ICM ye ere eae 
| SNOREAD will read in and return the next SNOBOLY state- | 
| ment. If no statements remain it will fail. | 
¡A P - E O O II A res cs —— —— o cmon 


DEFINE ('SNOREAD () S,LBL*) 


a a A ee ee ee es = Ee 
IĮ Initialization section: Establish 1/0 and initialize | 
{ patterns. | 


—————— A -—————————Á—— "ar RPM | 


INPUT(.INPUT, 5, 72) 


ALPHA =  'ABCDEFGHIJKLMNOPQRSTUVWXY 7 ' 
NUM = '0123456789' 
CONTINUE.S = POS(0) ANY('*.') REM. S 
SNO STMTS =  POS(0) ANY(ALPHA NUM ' *) 
SNO STMT = (POS(0) BREAK(' ;') 
+ FASTBAL( , '"! nem, t31) 1;1) . SNOREAD 


: (SNOREAD END) 


| NEED C MEE ELEME o N 
| Examine a buffer (SNO BUFFER) which presumably has charac- | 
| ters in it left over from the last read. If a statement | 


( can be pulled out, fine, just return. i 
Loic 


SNOREAD SNO_BUFFER SNO_STMT = : S (RETURN) 
eg O ECC LN a a OA PME 
| Otherwise check the buffer for null. If nonnull, then | 


| there is a syntactic error in the input. | 
| ERST a irae CR CP ERE NCC CR TIU ————--— X "cr SEY | 


IDENT (SNO BUFFER) : F (ERROR) 


| We now try to fill the buffer. We first make an attempt | 
| to read the first card of a sequence of SNOBOL4 state- | 
| ments. If this fails, we assume it's a comment or list | 
| control card; in either case we throw the card away and | 
{| try again until we succeed in getting a statement or hit | 
| an end of file. | 


SNOREAD_1 SNO_BUFFER = TRIM(READ(SNO STMTS)) :S(SNOREAD 2) 

READ () :F(FRETURN) S (SNOREAD_ 1) 
SS | 
| Scoop up all succeeding continue cards and place a | 
{ semicolon behind the last card. Then go back to the start | 
| of SNOREAD. | 


A — —————Á————————————————————————— — ee 
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SNOREAD 2 SNO BUFFER = SNO BUFFER ' * ?READ (CONTINUE. S) 
4 TRIM(S) :S (SNOREAD_ 2) 
SNO BUFFER = SNO_BUFFER *"';' : (SNOREAD) 
SNOREAD END 
Names referenced Name Type Where defined 
by SNOREAD; READ Function Program 9.1 
FASTBAL * Function Program 8.4 


* indicates name is referenced in the initialization section. 


(| Program E A tree, in the context we will be using it, 
E 9.5 E will be a collection of data in a hierar- 
(| TREEREAD || chical organization. An example of a tree 
an aeee ael is shown in Figure 9.1. 
rn ———y 
(——————————— 41 À d——————————————À 
| Ct l 
| I | 
| l | 
| | | 
reer" nn ——- ce 
| BI dl CN ER. IFI 
bos I i C 
| l | 
| | | 
| | l 
| | | 
(c1 (73 11v" 1 
1 D | | EI I Gc | 
LJ tJ i NES | 
Figure 9.1 


An example of a tree. 


There is a root node at the top (just the reverse of 
biological trees which have their roots at the bottom). The 
root node has 0 or more immediate descendants or sons. Each 
of these, in turn, have 0 or more immediate descendants. 
Moreover, each node has a value associated with it which, for 


the sake of current discussion, we will assume is a string. 


In the example shown in Figure 9.1, the root node has the 
value ‘At and its 3 sons have the values 'B', 'C' and 'F' 
respectively. 


Reading a tree implies both an external form by which the 
programmer specifies his tree, and an internal form by which 
the tree will be represented in the machine. These represent 
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two decisions which will have to be made before we can 
progress further. 


In general, the representation of computer data is' an issue 
which is perpetually confronted by the computer programmer. 
His choice can significantly influence the runtime and storage 
efficiency of the resulting program, as well as the ease with 
which he can write, debug, modify, and extend his program. In 
a string language such as SNOBOLS there is a built-in 
prejudice to represent. data objects as strings, because of the 
languages's rich string handling capability. That is, one 
feels that when it comes time to process the data object, ina 
way or ways not clearly foreseen at the start of the program, 
the necessary tools will probably be there. 


Another strong advantage of using strings to represent data in 
SNOBOL4 is the relative ease with which one can monitor the 
changing forms of the data. There are several semiautomatic 
tracing features available to the SNOBOLU user (&FTRACE and 
&TRACE) which print out the values of variables if they are 
strings, integers or reals but not otherwise. Under such cir- 
cumstances the advantage of using strings to represent data is 
more than obvious.* But even if these tracing features were 
not especially inclined to favor the string, there is nonethe- 
less a convenience in being able to display an entire data 
object in one fell swoop merely by printing a string. 


Another advantage of using a string to represent the data is 
that (in SNOBOLU at least) the data within the string will oc- 
cupy contiguous storage locations. This can mean that certain 
kinds of analysis can be made very rapidly by a scan. Many 
machines have built-in mechanisms for quickly scanning con- 
tiguous core storage for particular data items. Such efficient 
machinery can be brought to bear upon a data structure in con- 
tiguous core whereas it could not if the data were associated 
by means, for example, of address links. 


One reason for not representing a tree as a string is that the 
values of the nodes may not be conveniently representable as 
strings. Another reason may be that the operations that an 
application will typically make upon a tree may be rather un- 
natural for a string. We will show in a later chapter how a 
tree may be represented in SNOBOLU as a linked structure. For 
this chapter, we will consider only string representations. 


There are many ways in which trees may be represented as 
strings internally. To visualize one very exotic way, imagine 
that a tree is elaborately displayed in a printout page with 
lines of, say asterisks connecting up boxes denoting the 
nodes, etc. Then the sequence of lines of. this printable image 


* This limitation need not be viewed as a strict one. The 
discussion surrounding the function FTRACE, Prog. 14.3, 
describes how the values of data aggregates may be 
automatically dumped as well. 
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will, when concatenated, denote unabiguously a tree. Such an 
example is a very good One of how not to encode a tree. Not 
only is the encoding inefficient in terms of storage but it 
also would prove to be unwieldy in processing (selecting, 
searching, deleting, adding, etc.). 


One sane way of representing a tree is by a LISP-like 
representation [McCarthy, 1960]. A node is encoded 


(VrS1rS20.-- e Sn) 


where v is the value of the node, and where each s is the 
representation of a son. For example, the tree in Figure 9.1 
is represented as 


(A,B, (C, (D, E)) , (F,G)) 


Using such a representation, the value of nodes are restricted 
in that they may not contain commas or either of the paren- 
theses (or if they do, three other characters would have to be 
found at the loss of some notational naturalness). Another 
disadvantage is that, in many applications, it is convenient 
to be able to obtain, without an involved computation, the 
number of sons of a given father node. For both these reasons, 
we will use a slightly different method which is a variant of 
polish prefix notation (from Lukasiewicz (195, p. 78] but see 
Higman [1967, p. 24] for a nice general discussion. We will 
represent a node as 


VNS SageeegSn 


where, as before, v is the value of a node, n is the number of 
sons and s represents a son. The tree in Figure 9.1 would be 
represented as: 


A,3,B,,C, 2,D,,E, Fe leGee 
Here a node without sons is represented as 
Ves 


That is, the null string as well as an explicit 0 can be used 
to denote 0 sons. This blends well with the SNOBOL convention 
of regarding null strings as arithmetically equal to 0. 


The parenthesis-free or polish notation is somewhat more  dif- 
ficult to analyze visually than the parenthesis notation but 
it is significantly easier to manipulate and for that reason 
is a good machine representation. 


The external representation of the tree would be that form as 
it is keypunched onto cards or typed onto a teletypewriter. 
TO be more explicit, we are concerned with an external input 
representation as opposed to an external output  representa- 
tion. There are obvious fundamental distinctions between a 
tree representation which one is willing to type and a tree 


—— ES ERA GENE TREES A PE RUNS CED SEED ¡AA AO DO APE SE EES ARA LOD 


which one would like to see. For the former, we require ease 
of typing and ease of modifying which are not considerations 
of the latter. 


The form of external input representation we will use is 
Similar to the form used by COBOL and PL/I to represent struc- 
tures. The root node is said to be on level 1. Its immediate 
descendants are on level 2; the immediate descendants of any 
node are one level number greater than the level number of 
that node. Thus the representation of any node of a tree is 
given as 


where k is the level number of the node, v is the value of the 
node and each s represents a son (in the same format). For 
example, the representation of the tree shown in Figure 9.1 is 


WNWWNN = 
Q wgw 


This form of the tree is not difficult to type or to modify. 
It is also not very difficult to read, particularly if the in- 
put processor permits indentation (as ours will) so that the 
tree may be typed: 


The actual program to convert trees from the external input 
form into modified polish is given below. 


aA RA DN | 

| TREEREAD(level) will read a tree beginning at the given | 

| level. It will fail if this level is not found on the | 

| input. i 

Ext lu AA Lu LL ui A 
DEFINE ('TREEREAD (LEVEL) SONS, N') 


ee E | 
| TR BC is the tree break character used to separate items | 
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| in the strungout version of the tree. | 
| ODPEREMTDIGE MEME" A /  ———PoPv MUN | 
TR BC = ',! 
GG Ix A ee ge FULGOR A SCAM MCA SCIES L CN GL SECO ee oe 
| The pattern LEVEL.TREEREAD tests the level and extracts | 
| the value placing this value into TREEREAD. | 
E O A O A | 
LEVEL. TREEREAD = POS(0) (SPAN(' ') | NULL) *LEVEL 

+ SPAN(' ') REM. TREEREAD 

: (TREEREAD_END) 


AS 
| Read in the node at the current LEVEL and assign the value | 
| of this node to TREEREAD and tack on the break character. | 
( If the LEVEL argument does not match the input level then | 


| fail. | 
Execute s LL LU A eee 
TREEREAD READ (LEVEL.TREFREAD) : F (FRETURN) 


TREEREAD = TRIM(TREEREAD) TR_BC 
| MC MIELE IRI E ME a ee RR ee a ee ee ee CL I ee ee pe ey 
| Read in the sons of this node by calling TREEREAD recur- | 
| sively at a level one higher than the current level. The | 
| number of sons is counted in N. | 
Cos 


TREEREAD_1 SONS = SONS TREEREAD(LEVEL + 1) 
+ :F(TREEREAD 2) 
N = Ne 1 : (TREEREAD 1) 


E e XM MM CN A ID MMOL NMG IE a gt EMG DAE C c OC CC N CL MCCC DE HDI C C REC CRECEN 
| Concatenate the value of the father, the number of sons | 
| and the representation of the sons. | 
(ROREM RR D OUO SEU EE a RUE HN C EE OU A IO EMEN NEM MEUM IT | 
TREEREAD 2 TREEREAD = TREEREAD N TR BC SONS 

: (RETURN) 
TREERFAD END 


Names referenced Name Type Where defined 
by TREEREAD: RFAD Function Program 9.1 
Epiloque 


The first executed statement on entry to TREEREAD calls the 
by-now familiar READ, requesting that a card be read only if 
it is of the level requested. TREEREAD will then call itself 
recursively to obtain trees at levels one deeper. When recur- 
sion is called for, the savings in program length can be 
dramatic and the subjective effects exhilarating. There are 
types of environments in which recursion seems quite well 
suited. One of these environments is when the data structure 
is organized recursively such as the trees in this example. 


The break character is set in the initialization section to be 
a comma. This can change at any time by assigning a new break 
character to the variable TR BC. 
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am E. 

{{ Program |l The READ function (Program 9.1) is flexible 
B 9.6 0! to the extent that input can be obtained, 
(| MFREAD E not merely from the standard card reader, 
A AAA but from any file associated with the 
variable INPUT. That is, we could reassociate the variable 
INPUT in order to obtain the INPUT from a source other than 
the standard input. An example of a reassociation of INPUT 


was given in the FORTREAD and SNOREAD functions (Programs 9.2 
and 9.4); there, INPUT was reassociated not with a nonstandard 
file (although it could have been) but with a file whose 
record length was nonstandard (i.e., 72 rather than 80). 


It may be, however, that it is desired to read from two or 
more files simultaneously and then, the original READ would 
not do. Even if the user would be willing to reassociate the 
variable INPUT on each shift of the input stream, the scheme 
would not work because the saved string in INPUT_BUF would 
become hopelessly mixed between the various streams. 


But it is possible to generalize READ to handle multiple 
streams. Our extended version will allow a second argument to 
indicate the source. Thus 


READ(P, .SYSUT1) 


will read from source associated with the variable SYSUT1. 
Also, a null second argument will imply the stream associated 
with INPUT. Thus, READ(P) will be equivalent to 


READ(P, .INPUT) 


In this way our new READ will be upward-compatible with the 
old READ. 


The new READ, while more general, is less efficient than the 
old READ, and so there are advantages to both. In practice, 
one can do with the efficient READ until such time as it 
becomes necessary to read more than one stream; then one can 
simply 'plug-in' the more general READ. 


MFREAD(P,U,L) will behave like READ(P) except that an op- 
tional second argument (U) can be used to specify a unit 
other than the normal reader. An optional 3rd argument 
can specify a logical record length other than 80 (for the 
first call associated with a given unit). 

LLLI —— ——————————————————————————————————————————————————JÀ 
DEFINE('MFREAD(P,U,L) BUF,NF, NM, DATA!) 


ee ee ee CU ee ee ee AER 
{ Establish structure to hold data on each file. | 
A AAA A AA A ee | 


DATA ('RDATA (RNM,RBUF,RNF) !) 


A NR AAA 
{| Establish table to hold structures. Establish default | 
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| file. | 
ED RE E | 
READ TBL = TABLE() 
READ_TBL<> = RDATA(.INPUT) 


M RD CEN CE ED ee ae UM LEO C CE De EEN ye A 
| Sieze control on calls to the REWIND function. Do a real | 
| rewind but also discard any file information for unit N. i 
LLL ——————— MM ————M MM —— — ——— M — Ó— MÀ n — M — Ó€— — MÀ Ó a ]ÓÀ M ——— M. 


OPSYN (' REWIND.', 'REWIND!) 


DEFINE ('REWIND (N) *) : (MFREAD_END) 
REWIND READ_TBL<N> = 
REWIND. (N) : (RETURN) 


Geet er ge RM ee pet ee Ree a CM DNI D E I DE E LG Gee S aa NO. | 
| Entry point: Obtain DATA associated with unit U. If DATA | 
l is null establish an entry for this unit and input- | 
| associate some contrived name. | 
| —— ————— —————————— -————————— —————H—À | 
MFREAD DATA = READ TBL«4U» 

IDENT (DATA, NULL) :F (MFREAD_ 1) 

NM = 'READ:' U 

DATA = RDATA(NM) 

READ_TBL<U> = DATA 

INPUT (NM,U, L) 
a ME CLE CMS CAMDEN CIC Be ee NI C CERE 
| Arrival here means that DATA contains the data associated | 
| with our i/o unit. Extract information. If NF is less | 
| than 0 fail immediately. l 
AAA IEA ASIS 


MFREAD_1 NM = RNM(DATA) 
BUF = RBUF(DATA) 
NF = RNF(DATA) 
LT (NF, 0) : S (FRETURN) 


Ey p MSS EE EMI CIN MCI MERI XC A erage Sa) pee Gi ee gee M M EM gay Pag ee ee ee are ee | 
| If BUF is null, fill it. Then test it against P. If fail, | 


( FRETURN. Otherwise return BUF. | 
| EE LCD Mc o —À—————————————————————! | 


IDENT (BUF,NULI) :F (MFREAD_2) 
BUF = $NM :F(MFREAD 3) 
RBUF (DATA) = BUF 

MFREAD 2 BUF P : F (FRETURN) 
MFREAD = BUF 
RBUF (DATA) = : (RETURN) 


poe (€———— ———— Á—————ÁÓ— 
| Decrement NF and try again. | 
A ——— —— A O O QM O O O aue E A A | 
MFREAD_3 RNF (DATA) = NF - 1 : (MFREAD 1) 

MFREAD END 


Epiloque 


The extended version of READ is patterned after the single- 
file READ. There are several additional statements in the 
initializing section which set up the names of variables which 
are to be indirectly referenced. Beyond the label READ 3, 
things are pretty much the same as the simpler READ with in- 
direct referencing replacing the direct referencing. That is, 
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instead of referring for example to the variable INPUT BUF a 
reference to the variable $B is made where B has been assigned 
an appropriate name. 


The first statement executed (after the entry point) asSigns 
the name 'INPUT' to the variable F provided F is null. This 
is a common way of assigning default values to dummy 
parameters in functions. 


The reader may be somewhat alarmed as to the amount of over- 
head associated with each read request. This overhead, 
however, may be quite tolerable in a programming situation 
which involves relatively few reads compared with other com- 
putations or in a situation in which programming the problem 
costs more than running it. If the overhead proves excessive, 
the reader will find an outline for a faster Multifile READ in 
Exercise 4.6. 


| ¥#8% UTPUT ROUTINES | As was mentioned in the introductory 
($$ $4 — remarks of this chapter, output in 
(9 $ | SNOBOLU is almost magically simple. Assigning a 
1% % | string to the variable OUTPUT or PUNCH will print or 
( S£% | punch the string respectively. Moreover, it does 


AÑ not have the problems that input has; i. e. trans- 
mission is not typically tentative depending on the value of 
the string and output files are not sequenced like input files 
may be. But there are problems nonetheless. For one thing, 
printed output must appeal to the human eye which means ver- 
tical as well as horizontal allignment and this generally is 
difficult to do when simply outputting strings. For the same 
reason, overstriking, which calls for a perpendicular allign- 
ment is equally awkward and unnatural. Both of these obstacles 
are overcome quite easily with the use of the block datatype, 
a discussion of which is deferred until a later chapter. 


For this chapter we will consider only basic card output; 
i.e., output which is meant to be read by some other computer 
program. 


po 7--————— MM 

{{ Program l Just as it is good practice to focus input 
E 9.7 N activities into a single function, so it is 
N PUT i] a good idea to do the same for output. PUT 
———————— is a function which will accept as argument 


a string (of no greater than 72 characters) and print this 
card labeled and numbered in the identification field (columns 
73 through 80). It will also punch what is printed. 


Labelling is effected by the user of PUT by assigning a string 
to the variable PUT IABEI. Thus 


PUT LABEL =  'PUT' 


will set this label to equal the indicated 3 letters. 
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Numbering of cards is by increments of 1. Sometimes it is 
desired to increment by a number other than 1 which is accom- 
plished by setting the value of PUT_INC. Thus 


PUT_INC = 10 


will set the increment to 10. 


| PUT(L) will output L (presumed to be a card image). It 
{ will label the OUTPUTted card starting in column 73. The 
| user may specify the label by assigning a string to the 
| variable PUT_LABEL. The cards will be numbered in incre- 
| ments of 1; the increment can be changed by assigning an 
| appropriate value to PUT_INC. 

pee a a a À——— ——P—Á—————— a a A | 


DEFINE (' PUT (L) *) 


PUT_INC = 1 
: (PUT END) 
PUT PUT N = PUT N + PUT INC 
OUTPUT = RPAD(L,72) PUT LABEL 
* LPAD(PUT N, 8 - SIZE(PUT_LABEL) ) 
PUNCH = OUTPUT : (RETURN) 
PUT_END 
Names_referenced Name Type Where_defined 
by PUT: LPAD Function Program 3.2 
RPAD Function Program 3.3 
Epilogue 


Note that when OUTPUT is used on the right hand side of the 
assignment (last executable statement) the value last output 
is used as value and no OUTPUTing of information is implied or 
inferred. 


For debugging purposes, it is perhaps prudent to turn punching 
off. This can be done either by removing the assignment to 
PUNCH or by executing the statement: 

DETACH (. PUNCH) 


The latter is preferred since when it comes time to actually 
punch, it will be obvious what to do. 


Gere pe og ge ea | 

(! Program || In the description of FORTREAD (Program 9.2) 
B 9.8 N several examples of FORTRAN source proces- 
(! FORTPUT |! sing were given. In three of these examples 
 _—MMMMMIIMAS (preprocessing, conversion and housekeeping) 


the output is also FORTRAN and, in such cases, the programming 
situation can be simplified by writing an output function spe- 
cially designed for FORTRAN statements. 
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sg A ae EIU eg RIED IIR AL AC CICER S 
{| FORTPUT(S) will output a FORTRAN statement S. The card | 
| will also be punched, labeled, numbered, and continued if | 
| necessary. | 
ERA a ee a iii a o D" Come 


DEFINE ('FORTPUT(S) T!) : (FORTPUT_END) 


| Pa RA a ig er eas Ge We een Ne M eG Pe a. 
| Entry point: Remove initial chunk from S; output it; check | 
{ for completion, if so return. | 
| gU RNC EDD NCC RIED  ———————————— ——————À——————— 
FORTPUT S (LEN(72) | REM) . T = 

PUT (T) 

IDENT (S, NULL) : S (RETURN) 


ne II CEDERE CM RCM DCN ECC M EN g aR ae eat SOE rey CAT MA 
| Since something is left in S we must supply a continuation | 
| card. The location field of this continuation card (the | 
| first 5 characters) must be klank. | 


S = DUPL(' *,5) "1* S < (FORTPUT) 
FORTPUT END 
Names referenced Name Type Where defined 
by FORTPUT: PUT Function Program 9.7 
qoot UN 
(| Program Il SNOBOLU statement outputting (which we do 
li 9.9 | next in Program 9.10) is more complex than 
E PEEL N FORTRAN outputting attributable to the fact 
E _ A ——M— that a SNOBOLU statement cannot be split ar- 
bitrarily but only at a point where a blank may appear (but 
not within quoted literals). The determination of a suitable 
break point in a SNOBOLU statement will be done by the func- 
tion PEEL. This function is being isolated because it can be 


used for other purposes such as compressing and reformatting 
SNOBOLU statements. Also, a slightly modified version of PEEL 
can be used for finding break points in JCL (Exercise 9.8). 


PEEL(name, n) will peel off and return a prefix from the named 
string. The prefix is to be as large as possible but not 
longer than n characters. The named string will be modified. 
The prefix will be broken off from the named string only at a 
suitable break point defined as follows. The break may never 
appear within quotes. Given this first condition, it may occur 
before any of the characters in BEFORE or after any of the 
characters in AFTER. If no prefix can be found other than the 
null string then PEEL will fail. 


PFEL has a side effect. In addition to returning a value, it 
will modify a part of the outside world. In particular, it 
will remove a prefix from the string named by the first argu- 
ment. The modification of supplied arguments can only be 
accomplished in SNOBOL4 by passing as argument the name of the 
variable. Thus to remove a prefix from the string S the call 
to PEEI must be of the form 


PEEL (.S,n) 


(the call  PEEL('S',n) although equivalent is not recommended 
because it does not provide as good documentation and in some 
implementations is less efficient). This method of denoting 
arguments is a bit unusual inasmuch as the arithmetic 
languages, FORTRAN, PL/I and ALGOL permit functions to modify 
argument variables without the encumbrance of an initial 
period. At first, the initial period appears to be something 
of a nuisance. As it turns out, however, it has the important 
advantage of alerting the reader to the possibility of side 
effects. 


PEEL (NAME,N) will peel off and return a prefix from the 
named string. The prefix is to be as large as possible 
but not longer than N characters. The named string will 
be modified. The prefix will be broken off from the named 
string only at a suitable break point. The break may never 


appear within quotes. It may occur before any of the 
characters in BEFORE or after any of the characters in 
AFTER. If no prefix can be found other than the null 


string then PEEL will fail. 


p--------- 


Rc ae a ee a eee eed 
DEFINE ('PEEL (NAME. ,N.) K1. , K2. !) 
BEFORE = t) ,>! 
AFTER = '( ,<! 
PEEL.K2. = POS(0) TAB(*K1.)  (ANY(AFTER) àK2. | 
+ BAL(,'"' "'") (K2. ANY (BEFORE) | ANY(AFTER) @K2. | 
+ RPOS (0) @K2.)) 


: (PEEL END) 


| MM PNE EE SS E te ET NC CC CI DD M I KR ae ge NE ae gee eee | 
| If the NAME.ed string is no longer than N. characters, | 
| return the value and null out the variable. 

———————————— ——————— — I — — ——— E ———— ——"e—MnÀ | 


PEEL LE (SIZE($NAME.),N.) :F (PEEL 1) 
PEEL -  $NAME. 
$NAME. = : (RETURN) 


= 
| Otherwise we scan for a break point in the named string. 
| Our search begins after the K1.th character (K1. is ini- 
| tially 0) and assigns the numerical value of the break 
( point to K2. Ultimately K2. exceeds the value of N. at 
I| which point we transfer to PEEL 2. 

—— ———————— —————————————— ————————— —————íe 


PEEL 1 $NAME. PEEL.K2. : F (ERROR) 
GT (K2.,N.) : S (PEEL. 2) 
Kl. = K2. : (PEEL_1) 


E E A E RN ERES | 
( The breakpoint is now indicated by K1. and provided it is | 
| not zero we can return normally. | 
pc c ASE E IN A A AI | 
PEEL_2 EQ (K1.,0) : S (FRETURN) 

$NAME. IEN(K1.) . PEEL = : (RETURN) 
PFEL END 


Names referenced Name Type Where defined 
by PEEL: BAL * Function Program 8.3 


* indicates name is referenced in the initialization section. 


Epiloque 


PEEL is not as fast as it could be. The pattern PEEL.K2. ad- 
vances by 1 character at a time until overflow occurs. The 
inefficiency is normally not troublesome because PEEL will 
normally be able to return the entire string without having to 
search for a break point. Nevertheless, some applications 
might call for a faster PEEL and Exercise 9.9 outlines a 
method for increasing the speed as well as increasing the 
selectivity as to where kreaks may occur. 


The names of parameters and temporary variables (viz. NAME., 
Ne, K1. and K2.) were deliberately made strange so as to 
reduce the chances of duplicating the name passed as first ar- 
gument to PEEL. This issue is discussed fully in the Epilogue 
of the SWAP routine (Program 3.14). 


ao te eT en 
Program The function to output SNOBOLY statements is 


E E 
KK 9.10 |I shown in Program 9.10. PEEL has greatly 
(| SNOPUT 11 simplified its writing. 

C 


AA ep ee ee ee ee ae UPPER 
| SNOPUT(S) will output a SNOBOL4 statement S. It will han- | 
| dle automatically: labeling, numbering, punching, and, if | 
{ necessary, continuation. | 
A O a LO O TO O OO —Á | 
DEFINE (' SNOPUT (S) *) 
: (SNOPUT_END) 


SN E E E | 
| Output the first 72 characters (breaking appropriately). | 
AAA A A ——— O O NS | 
SNOPUT PUT (PEEL (. 5,72)) :F (ERROR) 
| pc DIEM CCELI CELL ECC CMM ICQ IC A CI NC C MN CAM RCE | 
{ If S is null we are done, otherwise peel off the next 71 | 


f characters and prefix with a continuation (+). Continue | 


{ to do this until S is null. | 
A O A E A A AER 


SNOPUT_1 IDENT (S, NULL) : S (RETURN) 
PUT('*' PEEL(.S,71)) :F(ERROR)S(SNOPUT 1) 

SNOPUT END 

Names referenced Name Type Where defined 

by SNOPUT: PUT Function Program 9.7 


PEEL Function Program 9.9 
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oe ero es 
| Exercise 9.1 | Extend the basic READ routine so that it can 
A operate like a pushdown stack. thus 


PUSH (* ABC!) 


PUSH ("XYZ") 

A = READ() 

B = READ('S!) 
C = READ('YZ!) 
D - READ() 


when executed will cause the following values to be assigned. 


A = ‘ABC! 
C = 'XYvZ' 
D = the next input card 
The PUSH & POP routines (Progs. 5.5 & 5.6) may be used. In 


fact, the PUSH above is assumed to be exactly Prog. 5.5. 


hM MM LEEREN | 
| Exercise 9.2 | Modify PARAGRAPH so that the start of the 


t-———— next paragraph is denoted by a pattern given 
to PARAGRAPH as argument. You may use the modified READ given 
in Ex. 9.1. 


on ee eee 

| Exercise 9.3 | Modify FORTREAD so that it returns the 
3 FORTRAN statement with all extraneous blanks 
removed (i.e., blanks not in positions 1 through 6, not within 
quotes, and not within a hollerith field (nH...)). 


Cae ee ns ae 
| Exercise 9.4 | Modify TREEREAD to accept trees whose struc- 
AS ture is denoted by 


(a) indentation (allow sons to have any indentation greater 
than their fathers) 


(b) numerical values without the restriction that level num- 
bers increase in steps of 1. 


In each case assume that the value of a node is some nonnull 
quantity. 


CoS ey UIN 

| Exercise 9.5 | Use READ to write a function called ASMREAD 
t-——————— which is to read in statements from IBM's 
0S/360 assembly language [IBM360b]. The fact that a given card 
is to be continued is denoted by a nonblank in column 72 but 
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this character is not considered part of the statement. The 
next following card (incredibly) must have blanks in columns 1 
through 15 and these blanks (but no following blanks) are 
ignored when building the statement. ASMREAD should fail if 
an inconsistency is encountered in one of the continue 
conventions. 


Ci. oe A AN 

| Exercise 9.6 { Write a multifile READ which avoids most of 
the inefficiences of multifile reading in 
the following way: When READ is called, control is directed to 
the label 'READ_* F where F is the file name. The statements 
transferred to can be compiled at runtime (using the CODE 
function) at the first use of file F and can be ‘custom-made! 
for the particular file name. 


Set EE 

| Exercise 9.7 | Given the tab mechanisms of keypunches and 
AS teletypewriters, it is easier, in typing, to 
left-justify elements within fields whereas many applications 
(especially numerical) call for right justification of ele- 
ments within fields. 


(a) Given an 80-character string (card image) in the variable 
S, write a single statement to right justify any left- 
justified element in the field which starts in colum 
numbered C and whose length is L. You may use LPAD and/or 
RPAD (Progs. 3.2 & 3.3). 


(b) Use (a) as the basis for a program which will right- 
justify elements in a deck of cards. The first input card 
contains a sequence of X's in each field to denote their 
locations. This can be converted to a sequence of number 
pairs and then (a) can ke repeated for each number pair 
and each card. 


i. ~~ ee Se 

| Exercise 9.8 | (a) Using READ, write a function (called 
tLLLL—————————-4 JCLREAD) which will extract a complete JCL 
statement [IBM360c] from the input stream (let it pass over 
and output all non-JCL). Delete unnecessary blanks between a 
control card and the following continue. Remove all comments. 


(b) Write a function to output JCL. (Hint: PEEL can be 
used.) 


(c) Test the two functions by replacing ina set of JCL 
statements every occurrence of t DSNAME=' by 
'DSNAME=LIBRARY.'. 


ES | 

| Exercise 9.9 | To improve the operating speed of PEEL 
t-———— (Prog. 9.9) one may search over  nonbreaks 
and/or decrease the number of break points. 


(a) Write a pattern which behaves like PEEL.K2. but which 
uses FASTBAL, Prog. 8.4, to rapidly scan over characters 
which are not significant in determining break points 
(viz. BEFORE, AFTER and the quotes). 


(D) If we reduce the break set (say AFTER = '-' and BEFORE = 
's') then we will have higher speed and the break points 
will be more aesthetically placed. There is the danger, 
however, that a nonnull peel cannot be made. Rewrite PEEL 
so that if it runs into difficulties with the given 
BEFORE and AFTER, it temporarily uses a stronger version 
of PEEL.K2. (richer BEFORE and AFTER) to crack the given 
Statement. 


| RENE E CEDE RUE TRE | 
{ Exercise 9.10 | (a) Let the variable NAME. have the value 
| ——— "—— A 


'TABEL SUBJECT PATTERN = OBJECT 3: (LABEL) ' 
What value is returned by the call 
PEEL('NAME.',35) 


(b Modify PEEL so that if the name given is a forbidden 
name, PEEL will go to ERROR. 


pei I IN 

| Exercise 9.11 | Using SNOREAD and SNOPUT write a SNOBOL4 
HAS program to process other SNOBOLU programs 
such that every call to the function ALPHA is replaced by a 
call to the function ALPHANUMERIC. 


Ko rmm 

| Exercise 9.12 | Using SNOREAD and SNOPUT write a program to 
AÑ squeeze out extraneous blanks from another 
SNOBOL4 program. Be sure to pack as many statements on a line 
as possible. 


iu] 
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| 1 

tp he paragraph you are reading now has been formatted by 
a a computer directed by the very programs we will 
E describe in this chapter. Paragraph formatting is a 
N special case of the more general activity known as 
us text formatting. Whereas the former activity is 


limited to the shaping of individual paragraphs the latter ac- 
tivity is more open-ended and includes page layout, pagina- 
tion, etc. 


What, the reader may ask, is so complicated about decomposing 
a paragraph into lines that we must spend an entire chapter in 
its discussion? If all that were involved in this process were 
the cutting of lines at convenient blanks and padding with 
blanks to right-justify margins, then we could dispose of the 
subject in about a page of text and 6 lines of code. But the 
task is complicated considerably by the seemingly minor 
details of backspacing, underscoring and hyphenation. Though 
the need for overstriking is relatively rare, it does exist 
and just as much code need be written if we are backspacing 
occasionally as frequently. In fact, it is quite normal that 
90% of execution time of a program is spent in only 10% of it. 
A grasp of this fact and its implications toward optimum 
programming is not always fully appreciated. All too often, 
programmers care only to get the program performing as  expec- 
ted without regard to efficiency considerations or, to the 
other extreme, have a compulsive urge to optimize every bit of 
it. Both miss the sound central approach of implementing ef- 
ficiently that portion which is used most frequently. In this 
chapter we will have ample occasion to employ this principle 


In Program 9.3 we showed how to read in a paragraph and in 
this section we will format it. Between these two activities, 
the paragraph may undergo conversions in what we will refer to 
as the pre-processing stage. If the original input device were 
a keypunch, then almost certainly some kind of upper to lower 
case conversion would be necessary. More generally, if charac- 
ters appear on the printer which are not available on the 
input device, a conversion is necessary to produce those 
characters. Another instance in which conversion is used is 
in the indication of variable information such as figure num- 
bers and exercise numbers. In a sophisticated text processor, 
these will be given in symbolic form to be converted to actual 
numbers when the text is printed. 


We will assume that, possibly as a result of this pre- 
processing, the input text will possibly contain the special 
characters BSPACE and USCORE. BSPACE, as its name implies, 
will permit the user to overstrike print characters. We will 
denote this character by backarrow (+) so that 'O-/' will 
print as '£'. Just what character the user types to obtain a 
BSPACE in his text is determined by the pre-processor. In the 
system used to prepare this document, the symbol '-* was used. 
Backspacing complicates such issues as separating a paragraph 
into lines and printing a line on a device which does not 
directly support the backspace character (such as a printer). 
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It also serves to cloud the issue of when a line equals 
another line. 


Overstriking can extend the set of characters which one can 
print. Several examples of interesting overstruck combinations 
are shown in Table 10.1. 


Table 10.1 Characters obtainable 
via overstriking 


(cent sign) 
(dagger) 
(double dagger) 
(not equal) 
(division) 
(symbolic blank) 
(right arrow) 
(left arrow) 
(Theta) 

(Phi) 

(Gamma) 


(Lambda) 
aa er a AI AAA | 


aR 


YNOODAVO* ti——”A 
Aen NU od -— 
Y QDAVNXó+ K++ & 


e~”N d 


USCORE is a character which appears in pairs and indicates 
that any material between them is to be underscored. Ina 
sense, underscoring is a special case of backspacing but, ina 
sense it is not. For example, we are permitted to break lines 
at blanks and expand lines at blanks for the purpose of  for- 
matting paragraphs. But we would also like to be able to break 
the line: 


"A quick brown fox really did jump over..." after the "really" 
so that we might print: 


A quick brown fox really 
did jump over... 


Note that not only are we breaking at a nonblank, we are ac- 
tually discarding a character. If the underscore character 
('_') were treated as a break character, then there may be 
difficulties with formatting paragraphs which contain ' '. One 
example of this is the paragraph you are reading now. Another 
example is 


"Printing the string 'A Beee ' yields 'A B'." 
In the above case it becomes not merely awkward but actually 


impossible to disentangle that which is regarded as under- 
scoring from that which is overstriking. 
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The USCORE character is inserted into the text by the pre- 
processor and is not actually typed by the user. The way in 
which the user will indicate underscoring will depend on the 
input device. In the system which formatted this text (and 
which is oriented toward key punch input) the underscore 
character ('_') is used to denote that the following word is 
to be underscored and a sequence of the form |  ... | im- 
dicates underscoring of an arbitrary string of characters. In 
a system oriented toward teletype input the sequence 


n-characters  n-backspaces  n-underscores 
could be translated by the pre-processor, into 


USCORE n-characters  USCORE 


[7 IM 

{{ Program || Backspace normalization is the process of 
E 10. 1 {| converting a string with backspaces embedded 
N BNORM 11 in it into a string which prints identically 
y NANA to the first but in which no 2 backspaces 
occur consecutively. Thus 'ABCD-———-123ü' is translated into 
t A-1B-2C-3D-U'. This serves to localize the effect of 


backspacing simplifying later processing. It also serves as a 
necessary prelude to image normalization as described in 
INORM, Program 10.2. 


To describe rigorously what is meant by B-normalization, we 
define the spacing of a string as equal to the number of 
characters in the string minus twice the number of BSPACE's 
and minus the number of USCORE's. Thus, the string 'AB-C' has 
a spacing of 4-2(1) = 2. The string 'AMB-CMA' (where M is the 
USCORE) has a spacing of 6 - 2(1) - 2 = 2. Informally the 
spacing of a string equals the net movement of the type ball 
(or equivalent mechanism) when the string is printed on a 
teletypewriter. Note that the spacing can be negative as in 
the string '<<A', 


We define a prefix of a string as any initial sequence of 
characters of the string. Thus, 'PR* is a prefix of the string 
'PREFIX!. In general, a string of n characters will have n+1 
prefixes including the null string and the string itself. 
Similarly, a suffix is any terminal sequence of characters. 
More formally, P is a prefix of S if there exists a string T 
such that 


P T = S 
and F is a suffix of S if there exists a string T such that 
T F = S 
A string is said to be balanced_on_the_left if the spacing of 


each of its prefixes is nonnegative. Informally, if, when 
printing the string, we attempt to force the typeball beyond 
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the left margin of the paper, the string is not balanced on 
the left. In a similar way, we define a string to be balanced 
Informally, a string is balanced on the right if its maximum 
rightward movement is reached at the end of the string. We 
call a string balanced if it is balanced on the left and on 
the right. 


Examples of strings unbalanced on the left are '-ABC' and 
'AB-———  '; such strings cannot generally be printed and are 
almost certainly errors. Any interpretation short of abnor- 
mally terminating the run will probably be an acceptable one. 
Strings unbalanced on the right such as 'FOB-=/* or 'ABC-' are 
not errors and have well-defined meanings. 


Let a character c which is neither USCORE nor BSPACE be embed- 
ded in the string S as 


S = 8, c Ss 


Then the position number of c is defined as equal to the 


spacina of S, plus 1. We refer to the characters of S other 
than USCORE and BSPACE as the position characters of S. 


Let S be a string without USCORES. Then the B-normalization 
of S is defined as that string S' such that 


1) S* is balanced 


2) The position numbers of the characters of S! are 
monotonically nondecreasing. 


3) The position characters of S' are identical to the posi- 
tion characters of S and each such character retains its 
position number and, moreover, any pair of characters 
having identical position numbers retain their relative 
ordering in S' as they had in S. 


As an immediate consequence of the definition, all position 
numbers in the B-normalization of a string are nonnegative. 
Hence, strings unbalanced on the left having negative position 
numbers will not have a B-normal form. On the other hand all 
strings balanced on the left have a unique B-normalization 
which can be produced by construction. This follows because 
items 1) and 2) assure us that S' is a sequence of substrings 
each representing one print position having the form: 


'C4+Cot*. ee -Cpn' 


where n21 and in general varies with the print position. The 
characters Cy, Car ... „Cn each have the same position number. 
Note that they all mus* retain their relative ordering. This 
is done not merely to make B-normalization unique, but also 
because we do not know the intended purpose of the 
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backspacing. Thus, C¿*C2 is indistinguishable from cg-c, when 
printed but if we choose to interpret ‘'+' as subscript or 
superscript the ordering is important. 


If S contains USCORES the situation is complicated slightly. 
What are we to make of 


'FOM—-/RTRANM' 
Should it be 
'FORTRAN' or  'FORTRAN' 
Obviously this is a mistake. The string to the right of  'M' 
should be balanced on the left so that the 'M' is not shifted 
to the right of characters which appeared after it. Similarly 


the string to the left of 'M' should be balanced on the right. 
Hence we define the B-normalization S' of the string S where 


S S, M Sg 


as 
s! = S," n So! 


where S,' and S2' are the B-normalized versions of S, and Sp, 
respectively. Of course, S, and S, may either or both contain 
USCORE's in which case the definition applies recursively. 


If any string S is balanced on the left, then REVERSE(S) is 
balanced on the right. Conversely, if S is balanced on the 
right, then REVERSE(S) is balanced on the left. 


Proof: The proof is simple but instructive. If S is balanced 
on the left then all prefixes of S have nonnegative spacing, 
by definition. If P is a prefix of S then REVERSE(P) is a 
suffix of REVERSE(S). Since the spacing of REVERSE(P) is the 
same as the spacing of P the spacing of the suffix is nonnega- 
tive. Since all suffixes of REVERSE(S) correspond in this way 
to some prefix of S, we conclude that S is balanced on the 
right. Ina similar way we can prove the converse. 


Proposition 10.2 


If S, and Sə are right-balanced then S, S; is right-balanced. 
Similarly if S, and Sə are left-balanced then S, Sə is left- 
balanced. 


Proof: Any suffix of S, Sə is either a suffix of Sg, in which 
case its spacing is nonnegative or is of the form F S, where F 
is a suffix of S,. But the spacing of F S, = spacing F + 
spacing Sz and hence is also nonnegative. Hence S, Sə is right 
balanced. In a similar way S, Sə is left balanced. 


Proposition 10.3 


Every suffix of a right-balanced string is right-balanced. 
Similarly every prefix of a left-balanced string is left- 
balanced. 


Proof: is obvious. 


An algorithm to B-normalize a string S containing no  USCORE's 
is given below: 


(i) Reverse S 


(ii) Apply the following transformation repeatedly until it 
can no longer be applied. 


S NOTANY(B) . X BB ONE POS.Y = BYXB 


(where B is the BSPACE character and where ONE_POS is a 
pattern which will match the shortest string whose 
spacing is 1). 


(iii) Remove initial BSPACE's from S. 


(iv) Test for double BSPACE or trailing BSPACE. If yes to 
either question, the original string was not left- 
balanced, respond appropriately. Otherwise return the 
reverse of S. 


TO illustrate the algorithm, let S be the string 
'tabcd----efgh'. By step (i) it is reversed to form 
'hgfer.edcba'. Step (ii) is a multistepped process il- 
lustrated in Figure 10.1, yielding the string shown. step 
(iii) does nothing. Step (iv) reverses the string to return 
'a-eb-fc-gd-h' which is the result sought. 


Step (ii) is the heart of the algorithm and does the fol- 
lowing. The spacing of (B B Y) is -1. Hence the position 
number of X is higher than the position number of all charac- 
ters in Y. Since in B-normalization the position numbers must 
be in ascending sequence, the X and the Y are interchanged. 
It is for this reason too that the transformation of (ii) must 
terminate since there are only a finite number of inversions 
in the original string. 


Will we be able to reverse all inversions? In order to have 
an inversion we must have at least one double BSPACE. If the 
double BSPACE is not remcved by (ii) then it either is at the 
beginning in which case it is removed by (iii) or the sequence 


NOTANY(B) B B 


Occurs in S but is not followed by ONE POS. This implies that 
S is not balanced on the right; the transformation indicated 
in (ii) preserves right balancing (the proof of which is left 
as an exercise) so this implies that the original reversed 


string was not right-balanced. This implies by Proposition 
10.1 that the original string S was not left-balanced. 


The definition of ONE_POS can be given recursively as: 


ONE POS = NOTANY(B) | B *ONE POS *ONE_POS 
this definition while 'correct' could prove impractical. Let 
us assume that 100 backspaces appear consecutively. Then 
ONE_POS will descend to 100 levels before matching. Though 


there is no inherent limitation on the number of recursive 
levels to which we can plunge, there are often practical 
limitations, and this will, in general, depend on the in- 
plementation. Since the limit on the recursive depth has been 
known to be less than 100 for some implementations and since 
100 consecutive backspaces, while unusually large, is not an 
unreasonable quantity, we must seek a solution. We solve our 
problem by scanning first for a group of BSPACE's (viz. 5 of 
them) and only if the group is not there do we choose to try 
the case of one ESPACE. Thus 
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ONE POS = NOTANY(B) | 
* DUPL (B, 5) FENCE *FIVE POS  *ONE POS | 
+ B *ONE_POS *ONE_POS 

FIVE POS = ONE POS ONE POS ONE POS ONE POS ONE POS 
The maximum recursive plunge becomes [k/5] * REMDR(K,5) where 
k is the number of consecutive BSPACE's. If recursive levels 
of 70 are permitted, we can tolerate k<338. We can use the 


same basic scheme to achieve even longer lengths of  consecu- 
tive BSPACE's but 338 should suffice. 


Note the effect of FENCE. If it were not there our clever 
scheme would be thwarted if a long sequence of  BSPACE's ap- 
peared in a string which was unbalanced on the left. The 
reason is that, as we have discussed earlier, the right-most 
*ONE POS will fail. Without the FENCE the alternate 
B *ONE POS *ONE POS will be tried. We will ultimately recurse 
as many levels as there are BSPACE's only it will take longer. 


Poe eg oe EE CE ge CX M MK EE IE CLR of ae Se = I (ANE ee Ge ee ce? ee 

| BNORM(S) will return the B-normalization of the string S. | 

( Blanks will be prepended to S if it is not balanced on the | 

| left. | 

| NEEE ES E E E S E ASE | 
DEFINE (' BNORM(S) B,S1,S2,X,Y,P') 


er TI 
| Initialize patterns | 
AA ——"———O!————— "——— ———HÓ ("| 


ONE POS =  NOTANY(BSPACE) 

+ | DUPL(BSPACE,5) FENCE *FIVE_POS *ONE_POS 

+ I BSPACE *ONE_POS  *ONE POS 
FIVE POS = ONE POS ONE POS ONE POS ONE POS ONE POS 
IF BSPACE =  BREAK(BSPACE) 


: (BNORM END) 


aa cac M Del p CE ELO ME CIC M EMEN MI M LI KA a DL a ee NM OC M M ML MOM eee | 
| Entry point: First make a quick scan to see if any | 


| backspace character exists in S. If none such, return | 
( immediately. I 
CL — M — M M M À—M € —— M —— Óa—Ó E 
BNORM S IF BSPACE :S(BNORM 1) 

BNORM = S : (RETURN) 


Gey ER D MGE AM epee A MI ge ye, FLT E 
{ Are there any USCORE's? If so, subdivide and recurse. { 
ee A IN | 
BNORM 1 (S BREAK(USCORE) . S1 USCORE REM . S2 :F(BNORM_B) 
BNORM = BNORM(S1) USCORE BNORM(S2) : (RETURN) 
E MR IE KM ARAS 
| Reverse the string and apply the transformation described | 
{ in the text. | 
| c a a ———— ————H—— —————— EE —— aJ 


BNORM B S = REVERSE (S) 
B = BSPACE 
P = NOTANY(B) . X B B ONE POS. Y 
BNORM2 S P = B Y X B :S (BNORM. 2) 


ere ee ea AAA CMM ML MGE IPFI MCCC 
| The transformation has been applied as far as it will go. | 
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| Remove leading BSPACE's. | 
Üs a ce nl 


S POS(0) SPAN(B) = 


ERA A eG oe OE EL pT eg A eg CE IE CMM ee ee a 
| If a double BSPACE or trailing BSPACE remains, add a blank y 
| to S and try again. Otherwise reverse and return. { 
EIL a ne em ag ms ee Se xem nS a sus ses ies CE MM CREE in en tn aaa 


S BB :S(BNORM UNB) 
BNORM = REVERSE (S) 
BNORM POS(0) B : F (RETURN) 
BNORM UNB S = S ''* : (BNORM 2) 
BNCRM END 
Names referenced Name Type Where defined 
by BNORM: REVERSE Function Program 3.6 
BSPACE * Character 
USCORE Character 


* indicates name is referenced in the initialization section. 


Epilogue 


BNORM was written under the assumption that most paragraphs do 
not contain USCORE's or BSPACE's. Such paragraphs are handled 
as efficiently as possible. Other paragraphs are not treated 
as quickly as could be done. Specifically, patterns are not 
predefined where they could be. The scanning for the pattern 
P could be replaced by a more elaborate process so that double 
ESPACE would be found rapidly via BREAKX. Similarly, the 
double BSPACE check at the end could also be done more rapidly 
using BREAKX. Another improvement might be to handle the spe- 
cial case of 


n-nonBSPACE's n-BSPACE's n-nonBSPACE's 


by a variant of the BLEND operation. But such sequences are 
likely to be used in the case of underscoring so that the pre- 
processor would be expected to catch this special case. 


Given our assumptions, however, none of these changes seem 
warranted, since, for seldom used code, we want to be guided 
more by the desire to save program space (which is also worth 
money) than execution time. If the ground rules change, 
rewriting according to the above principles may be indicated. 


Note that if S is not left-balanced, BNORM(S) returns a 
balanced string which is similar to S. An alternate approach 
would be to have BNORM fail. In the latter case, however, the 
calling subroutine would have to specify recovery operations. 
This can become a continuing nuisance and can be all the more 
irritating because it involves a case which probably will 
never occur. 
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quur ge ee | 

(| Program || Image Normalization, or I-normalization is 
E 10.2 E the process of converting a string having a 
[KM INORM E given printed image into a unique represen- 
td tation for that image. Thus, the string 
'O<-/' and '/-O' when printed, will have identical printed 
images, viz. '@'. Also, the image produced by 'X- ' is the 
same as the image produced by simply 'X' implying that over- 
struck blanks may be dropped in I-normalization. The reason 


for I-normal form is to be able to determine equality of prin- 
ted images based on the characters used to produce the images. 
In addition, we would also like to scan a string which 
produces an image to determine whether a subimage appears 
within it. For example, suppose, in a time-sharing system, a 
programmer had typed in the phrase: 


",.. such a string is called a convoluted rope." 


and he wishes to change something in the string. Most time- 
sharing systems have editors in which one can specify a sub- 
string to be searched for and a replacement to be made, so 
that the user could say in effect 


change 'rope'! to 'string' 


Assuming that USCORE is not being used and that no normaliza- 
tion exists, the above substitution request could result in 
the string 


=> A Gap eee ce oe 


Since 'rope! has fewer characters than  'string', the under- 
lining is no longer correct. To compensate, we may request 
the editor to 


change 'rope——. _' to "string ——— : 


We may obtain the desired result, but then again we may not. 
If, in the original, we had typed 'rope* before underscoring 
‘convoluted! this particular string sequence would not be 
found. Moreover, if we had typed the period before under- 
scoring 'rope' we also could not make the indicated replace- 
ment. If, in the latter case, we made so simple a request as 


change '.' to !! 


we might obtain 


This state of affairs can be quite frustrating, especially 
when repeated attempts to make replacements result in failure. 
Image normalization will permit us to escape from this 
malaise. 
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Earlier we mentioned that  B-normalization is a necessary 
prelude to I-normalization. That this is true is a deriveable 
result. 


By an image we mean a configuration of printing on paper, 1 
character high and 0 or more characters wide. We may speak of 
concatenating images just as we concatenate strings. Let the 
image I be produced by each of the set of strings S,, So, ... 
where the sequence goes on indefinitely because there is no 
limit to the number of backspaced blanks that can be added 
without changing the image. Let N(S) be the function which 
converts a string to its I-normal form. If N(S) is working as 
it should then N(S,), N(Sg), ... will all produce the same 
string. Hence we can meaningfully speak of N(I) where I is an 
image. The value of N(I) will be N(S) where S is any of the 
strings which produce I. If, for example, N('!O-/') happens to 
be '/-O', we may say that N('2') equals '/-o'. 


Our intended purpose is to be able to scan a given image I for 
a subimage I' by scanning N(I) for N(I'). This implies that 


N(I, Ig) = N(I4,) N(To) 


that is, the function must be homomorphic (with respect to 
concatenation of images). This is important because it means 
that the function N() is completely specified by a knowledge 
of N(I) where I ranges through all single print-position 
images. (See Chapter 3 for a further discussion of homomorphic 
functions.) 


The notion of normal form implies that the thing considered 
'normal! is actually a member of the class it represents. That 
is, if Sy, So, ... is the set of strings corresponding to 
image I then 


for some n. If, moreover, we make the normal form irredundant 
in the sense that no characters can be removed without 
changing the image, we are left with the conclusion that the 
normal form of, for example, the overstruck combination A can 
either be 'A- ' or '_<A', but nothing else. Hence, the mapping 
of a single position must be of the form 


C4 *Cor eee *Cn 
where n 2 1. This observation coupled with the fact that N() 
must be homomorphic implies that a string in I-normal form 
must also be in B-normal form. 
The order of striking is unimportant in the final image 
produced. For example can the reader determine which character 
struck first in the set of overstrikes below? 


sgg 


A Ee O AR CO E O A A A DAS ee CO AA PED CD GE A O A AED CEDAR A E O AO A aE uS. 


The answer (although not obvious) is that the slash appeared 
first at positions 1, 2 and 4. 


The question of which images are distinguishable is an impor- 
tant one but, unfortunately, is one which depends on the 
equipment used and, to a certain extent, on the discriminating 
powers of the individual. Will, for example, a character 
overstruck with itself produce a different image than if it 
were not so overstruck. Is, for example, 'A' different from 
'A'? We will hold that it is and that use can be made of the 
resulting boldface. However, not all media are like printers 
in this respect. The all-or-none characteristic of cathode 
ray displays may prohibit this assumption. Also, some time- 
shared editors (eg. Saltzer [1964] ) have been known to nor- 
malize away bold face. 


Another source of ambiguity is that different overstruck  com- 
binations can resemble each other. For example 


t t T 
were produced respectively by the combinations 


tjer! a UE 
Though they can be distinguished when compared, they may not 
be so distinguishable if viewed in isolation. 


Another issue is the non-printable character. As mentioned 
earlier (Chapter 2), most of the 256 EBCDIC characters are 
non-printing. TO be consistent with the previous notions of 
image identity, each of these should be converted to blank. 
This we will not do for 2 reasons. Experience has shown that 
use can be made of a character that prints blank but which 
really isn't a blank for the purpose of line breaking and pad- 
ding (so-called hard blanks). Also, the notion of nonprinting 
character is device dependent. The subscripts (such as ',') 
are non-printing on most printers (and most devices) but 
should not be converted to blank each time they appear in 
text. A program is usually not dedicated to a particular 
device and in fact may be in simultaneous communication with 
2 different devices. In such cases, the notion of non-printing 
character, loses its significance. 


As a result of these considerations, we will assume a string 
S, of overstruck characters can be distinguished from a string 
Sə if and only if 

ORDER(DIFF(S,,' ')) A ORDER (DIFF (S>2,* ')) 


(See Progs. 3.10 and 3.1). This leads to the following defini- 
tion. A string is in I-normal form if 


(1) ¡it is in B-normal form, and 


(2) for every sequence of the form 
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Cy *Co- eee Cn 


where n>1, the characters are in alphabetic order and contain 
no blanks. 


A string can be I-normalized by placing it in B-normal form, 
removing overstruck blanks, and alphabetizing overstruck 
characters as is shown below. 


A ah cee I ge WS en Te ee ar ee a a ROO GENS SEN I aye gees MS EM ee D Ce ea ae Oe 
| INORM(S) will return the Image Normalization of the string | 
gs l 


AAA E NI II A IS | 


DEFINE (' INORM (S) C,CC,S1,K!) 


E A ER RA | 

| Initialize patterns. PR POS will find a print position | 

| containing backspaces. | 

A O O II a ime O E A | 
PR POS = POS(0) ARB . S1 (LEN(1) BSPACE LEN(1) 

+  ARBNO(BSPACE LEN(1))) . CC (NOTANY(BSPACE) | RPOS(0)) . C 
: (INORM END) 


Wp E UE NIIS x uec T LECCE EEUU IT 
| Entry Point: If no BSPACE's are present, return im- | 
| mediately. Otherwise B-normalize S before going further. | 
rea UU EE qa UT EO METTI "aC E e RUM M cO ERNEUT M UP T UTE CN MUN E | 
INORM S IF BSPACE :F(INORM RET) 

S = BNORM(S) 
E DC CM LEGE GC LA LM EC MC UAR CU COMME CMM AMD ECKE I LR MERI ME S 
| Look for a print position involving BSPACE. If none are | 
(| left, return. Otherwise, ORDER the overstruck characters. | 
¡AA E IN FEM UCM TECUM MEUM ME OEA EE | 


INCRM LOOP 


S PR POS = C :F(INORM RET) 
CC = DIFF(CC,BSPACE ' ') 

CC = IDENT(CC,NULL) ' ! 

CC = BLEND( ORDER(CC), DUPL(BSPACE, SIZE(CC) - 1) ) 
INORM = INORM S1 CC : (INORM_LOOP) 


qu LU EU pe ee eg RA 
{ Common return point. | 
| Ree PTT ———À—————————————— ———— ————— Ó——À | 


INORM RET INORM = INORM S : (RETURN) 

INORM END 

Names referenced Name Type Where defined 

by INORM; BNORM Function Program 10.1 
IF PSPACE Pattern Program 10.1 
ORDER Function Program 3.1 
BLEND Function Program 3.7 
DIFF Function Program 3.10 
BSPACE * Character 


* indicates name is referenced in the initialization section. 
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Epilogue 


Here, as in BNORM, we adopt the view that while it is essen- 
tial to handle the case of no backspace characters rapidly, we 
can take our time with strings in which they are present. In 
particular, if no special characters exist in the argument S, 
control passes to INORM_RET where an exit is made. It seems 
as if an unnecessary concatenation is performed at INORM_RET 
but the system is smart enough to return the other argument if 
one of ther is null. 


If the assumption that BSPACE's are rare is invalid there are 
several ways of increasing its speed. One method would be to 
rewrite PR_POS so that BREAK is used rather than ARB to search 
for a BSPACE. The writing of PR_POS is complicated by the fact 
that BREAK carries one further than where one might like to be 
but this can be handled by failing and alternating. See Exer- 
cise 8.5. 


Another method of ‘speedup works on the fact that the great 
majority of overstruck positions have only 2 characters at 
that position. Handling of this as a special case can avoid 
the call to ORDER most of the time. 


Cy: |7 META 
| Program {| Given a paragraph stored as one long string, 
1! 10.3 li we will need a function to separate the 
lI LINE KK paragraph into lines. LINE (CW) will return 
A -AAA<<á— the next cluster of words which will just 
fit within a column width of size CW. To initialize LINE a 
call is made to LINE_INIT(P) where P is the paragraph to be 
decomposed. When LINE(CW) fails no more characters remain. 
Thus 
LINE_INIT('A QUICK BROWN FOX JUMPED OVER THE LAZY DOG. !"') 

L OUTPUT = "en  LINE(10) "tu :S(L) 
will print 

"A QUICK' 

'BROWN FOX' 

' JUMPED'! 

'OVER THE' 

'LAZY DOG.' 


If the global variable JUSTIFY is given the value 1 then the 
right margin is justified. Thus if 


JUSTIFY = 1 


had been executed prior to the calls to LINE(10) the values 
printed would have been: 


"A QUICK’ 
'BROWN FOX! 


t JUMPED'! 
"OVER THE' 
'LAZY DOG.' 


Here, JUSTIFY serves as a switch and follows the same conven- 
tions as SNOBOLU keyword switches (i.e. an integer not equal 
to 0 is on; an integer equal to 0 or null is off). No attempt 
is made to justify the last line or a line in which no spaces 
appear. 


In general, justifying text of small line widths suffers from 
the possibility of words exceeding the column width and single 
word-lines (such as 'JUMPED') not meeting it. These ill ef- 
fects diminish in significance as the column width increases. 
Hyphenation (Program 10.7) also helps in this regard to 
produce a document with less white area. 


Breaking a line at a suitable break point must seem like sheer 
simplicity. If the column width is CW, then go out to that 
position * 1 and start marching backward until a blank is 
found. This should be our breakpoint. But this doesn't always 
work for several reasons. It won't work if we allow the pos- 
sibility of USCORE's and BSPACE's. Consider the example 


"A WQUICK BRO-/WNM FO-/X' 


If the column width is 15, the first 3 words will easily fit 
within a column, but the above algorithm will pick up only the 
first two. This is because the spacing of a string may be less 
than its size. 


Another reason that we cannot use the simple algorithm is that 
a string may be reduced in size by contracting certain sub- 
strings such as converting double blanks to single blanks. 
Such a condensation will, in general, be preferable than ad- 
ding a large number of blanks into the line. In order that 
this technique be effective we must include in our considera- 
tion enough of the paragraph in order to take advantage of any 
conceivable condensation. 


A third reason has to do with hyphenation. Hyphenation al- 
gorithms are not very good unless the entire word to be 
hyphenated is available. 


In all of these cases we need to have sufficient context in 


order to make an intelligent decision as to how to break a 
line. 


Another difficulty has to do with the assumption that all 
blanks separate words. Consider the string 


"A QUICK BROW-—/ N FOX! 


Here a blank is used to get over the 'W' and not to end a 
word. But we may convert the string to B-normal form to obtain 
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'A QUICK BRO-/W- N FOX! 


From any string we may safely remove either of the combina- 


tions ‘+ ! or ' et without changing the image printed. 
Moreover, by making such deletions from the B-normal form we 
will remove all overstruck blanks. Any remaining blanks will 


be regarded as true word separators. 


There are cases when a user does not wish to have a blank 
treated as a word separator. (There are some examples of this 
in the preceding paragraph.) In such instances the user of 
the system may inject into his text so-called hard blanks. 
These are any nonprintable character other than blank. As an 
example, the 0-8-2 punch provides the 029 keypunch user with 
such a hard blank. For input devices which do not have a spe- 
cial key for this purpose, the system can provide a special 
character which will be appropriately converted. 


The contractions which should ke permitted in a line of text 
will vary with the application, taste and perhaps with the 
column width. Almost certainly, we should be permitted the 
freedom to convert the two blanks which normally separate 
sentences into one blank. Often we may condense strings of 
the form 
punctuation-mark blank 
by removing the blank. For example 
'A quick, brown, angry fox ...' 
could also be rendered 
"A quick, brown,angry fox! 
We can associate with each string S a minimum printing width 
MINP(S) which is equal to SPACING(S') where S' equals S after 
all allowable contractions have been made. Then 
MINP(S) < SPACING(S) < SIZE(S) 
We define a natural break point as the SIZE of a prefix which 
ends in a nonblank which immediately precedes a blank. Thus, 
the natural break points of 
"A Wquick, brown, angry foxM jumped ...! 
are 


1 9 16 22 27 34 ... 


Associated with each breakpoint is a spacing. For the above 
example, the spacings are: 


1 8 15 21 26 32 ... 
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Clearly, if a spacing exists such that it exactly equals CW, 
there is no problem. Sufficient context is defined as the 
break-point associated with the smallest spacing equal to or 
greater than CW. Denote this break-point Bə and denote its 
predecessor B,. Denote the associated spacings (or widths) W, 
and Ws. Then 


W < CW < Wo 
Denote the associated prefixes X, and Xs. Then 


SIZE (X4) 
SIZE (X32) 


By 
B2 


Without hyphenation we have 2 choices, either to expand X, by 
inserting blanks or to squeeze Xs. We will assume that the 
aesthetic liability (termed Ugly Factor (UF) in the program) 
associated with inserting a blank is equal to that associated 
with removing a blank (exercises will explore other less sim- 
plistic possibilities). Hence we seek the minimum of 


Wo - CW and CW - W, 


Of course, if it is not physically possible to shrink Xs, to 
size, we must use X,. 


If hyphenation is available, we consider each hypvhenation 
point in turn and seek to minimize the contraction or  expan- 
sion necessary. Also we add an additional cost (of 1) for the 
aesthetic loss due to hyphenation. 


The algorithm to obtain sufficient context (Bs) is simply to 
look at break-points at CW, CW*1, CW*2, etc. and keep looping 
until a spacing is found greater than or equal to CW. Since 
the spacing is less than or equal to the break-point, no 
break-point below CW is needed. To find a break-point at CW, 
however, it is necessary to look for blanks beginning at CW-1. 


LINE(CW) will return the next line of a paragraph passed 

to LINE INIT(). The column width is CW characters. LINE 

will fail when no more lines remain. If HYPHENATE is non- 

zero, words will be hyphenated. If JUSTIFY is nonzero the 

lines will be right-justified (padded with blanks). 

Ea ee UU oe a nC EE CPC DE NR te ESI MOUTH TCR EN La PE 
DEFINE('LINE(CW)B,B2,TRY,X2,W,W2,T,RWORD,UF ,UF1,' 

+ 'K,H,HYPHEN!) 

HYPHENATE 

JUSTIFY 


Hou 
ah 


DEFINE('LINE INIT (P)T*) 
SALPHABET LEN(1) . HARD BLANK 
: (LINE INIT END) 


| Cop PM M C QN CC (pK (EMI III DL LM CILE M ICA DENM CDM CMM | 
| Entry point for initialization:  B-normalize the paragraph | 
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| and remove any overstruck blanks from P. | 
AF A E A i ei A ce cca 


LINE INIT P IF BSPACFE :F (LINE I1) 
P = RBNORM (P) 
LINE I2 P BSPACE ' ' = :S(LINE I2) 
LINE I3 P ' ' BSPACE = :S(LINE_T3) 
AA A I RR SIRES AN GT A ARA SSS, 
{| Replace leading blanks (if any) by ‘hard blanks! (i.e. 


(| blanks not subject to reduction or expansion). Append a 
{ blank to make scanning easier. (U SAVED contains an under- 
| score if there was an unterminating underscoring left over 


| from the last line. 
| PC — — — — É— ——  — ——À—— ———'— ep! c —(—— arse Mew | 


LINE I1 P POS(0) SPAN(' ') AT = DUPL(HARD_BLANK,T) 
P SAVED = P '' 
U SAVED = : (RETURN) 


LINE INIT END 


nn I MICE CS ECC I RSS ILC QD IDE EC C E AOI E QC D D LA E a oe 
| Initialize patterns for LINE. | 


eee ec ae ———— e !——— Cer eme anl 
SUFFICIENT CONTEXT.X2 =  (LEN(*TRY) BREAK(' ')) . X2 
+ @B2 SPAN(' ') @TRY 
FIND.RWORD.T = AT BREAK(' ') . RWORD SPAN(' ') aT 
EXTRACT.LINE =  LEN(*B) . LINE  (SPAN(' ') | NULL) 
IF USCORE = BREAK (USCORE) 


: (LINE_END) 
q a MMC MM ICM C M CMM IM EMEA 
Entry point proper: Obtain sufficient context (B2, X2). | 
If a sufficient context does not exist, go to LINE SMALL. | 
Keep looping back until a sufficient context is obtained | 
or is determined not to exist. If the spacing, W2, exactly | 


equals CW, this is the desired breakpoint, B. | 
| C —«— —— ——H—————— —  T——— ————Á—P——— J---——-——— 


LINE TRY = CW -= 1 
LINE 1 P SAVED SUFFICIENT_CONTEXT. X2 :F (LINE SMALL) 
W2 = SPACING (X2) 
GE(W2, CW) :F (LINE_1) 
B = EQ(W2,CW) B2 :S (LINE. 2) 
—————————M—— ———M— á—H eee 
Find the last word RWORD in reversed form from X2. From | 


the breakpoint T, compute a tentative breakpoint B (this | 
is actually B1) and a tentative ugly factor UF (the amount | 
| by which X2 must be expanded). | 
Redde rmn ee 

REVERSE (X2) FIND. RWORD. T 

B = B2- T 

UF = CW - SPACING (SUBSTR (X2, 1, B)) 
| EEG CCELI M IC C nC UND I ME ILC II ELEC CN CL D RCM GN CE IV IM MC CM AR MOM LACE N S ERRNEE 
| Starting with no hyphenation (K=0) and looping for 
| increasing degrees of hyphenation , determine a) if the 
| 
l 
| 


- 
| 
| 
l 


line will fit and b) if the cost of padding plus hyphena- 
tion (UF1) is less than the lowest so far achieved. W is 


the spacing of the reduced line. 
A A AAA 
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K = 0 
LINE_3 LE (MINP (X2) - K + SIZE(HYPHEN), CW) :F(LINE 4) 

W = W2 - K + SIZE (HYPHEN) 

UF1 = CW-W 

UFI = LT(UF1,0) -UF1 

UFI = UF1 + SIZE(HYPHEN) 

GE (UF 1, UF) 2S (LINE_4) 

B = B2- K 

UF = UFT 

H = HYPHEN 
LINE_4 K = NE(HYPHENATE,O) HY PHENATE (RWORD,K + 1) 
+ :S(LINE, 3) 
E ee Ge CC KG S RE CIRCE MC CRM EID ee ee CC F GEM UM I ee EO IEEE: : 
| Enter here with B set to break point and with H set to | 
| null or *-*, i 
LAA EEN E A ESE EEE E E EES E E E E EE E ee ed 
LINE_2 P SAVED EXTRACT. LINE = 

LINE = LINE H 

LINE = NE(JUSTIFY,0)  PAD(LINE, CW) 


ce re re a ey a UN ILI ee I TU E ee eg ee E UL ag A | 
| If an odd number of USCORE characters appear in LINE, set | 
| the value of U_SAVED to USCORE to be tacked onto the next | 
| line. | 
A ee O RO ee — —Á— | 
LINE_USCORE 


LINE = U_SAVED LINE 

LINE IF USCORE : F (RETURN) 

U SAVED =  DUPL(USCORE, REMDR (COUNT (LINE, USCORE) ,2) ) 
LINE - LINE U SAVED : (RETURN) 


| PEE Fe ee QC CDM MM LÁ IM NG O TES EAT eg ESE oe eT ek IK MC MEC RC IC MEL CN ET EC LRL e 
| Entering here means that whatever remains is small enough | 
{| to fit in a line. If nothing remains, FAIL. | 
pac Pc Tan UIT M MESES 


LINE SMALL 


IDENT(P SAVED, NULL) :S (FRETURN) 
LINE = TRIM(P SAVED) 
P SAVED - : (LINE USCORE) 
LINE END 
Names referenced Name Type Where defined 
by LINE: REVFRSE Function Program 3.6 
PAD Function Program 10.4 
SUBSTR Function Program 3.9 
MINP Function Program 10.6 
BNORM Function Program 10.1 
IF BSPACE Pattern Program 10.1 
HYPHENATE Function Program 10.7 
USCORE * Character 
BSPACE Character 


* indicates name is referenced in the initialization section. 
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— e— ipie ELE EK ais CEP carus. 


AAA A 
(| Program (| PAD(S,CW) will add or delete blanks from the 
1! 10.4 Ii string S as necessary to adjust the spacing 
N PAD E of S to equal CW. When blanks are added they 
y MMMMMMM¿ANAS are not always added from the same direc- 


tion. Otherwise the process would tend to produce more white 
area on one side as opposed to the other. "White areas running 
vertically down the page are termed rivers and large bodies of 
white areas are termed lakes. It is good formatting practice 
to prevent rivers and lakes from forming. 


The writing of PAD is greatly simplified by the assumption 
that S is B-normalized and contains no overstruck blanks (a 
fact assured by the activity in LINE INIT). This implies that 
every blank separates 2 balanced substrings and so blanks may 
be inserted without causing misalignment of overstruck 
characters. 


E E A gs, MCI DD DO ee I PORE ICM OR ee ey | 
| PAD(S,CW) will add or delete blanks to the string S to | 
| make it conform to a column width of CW. | 
| AAA E A E AS AAA A AAA | 


DEFINE('PAD(S,CW) I, K,T,N') 


[as A a 
| This pattern looks for the first blank which is not in a | 
{ sequence of initial blanks. | 
A Se a A —— G—n A A O O | 
INTERIOR BK = ((SPAN(' *) | NULL) FENCE BREAK(' !)) . T 

: (PAD END) 


OMM MM M MEC MEME MM M IM CI NEC CC: 
| Entry point: Determine the number of blanks (N) to be ad- | 
{| ded. Branch to PAD REDUCE if N < 0. | 
<A A II A eee cee econ ls A A | 
PAD N = CW - SPACING(S) 

PAD = LE(N,0) S :S (PAD_REDUCE) 


RAR REA LC SSS ES ESS DEGGIE SSSI SSS TSS KLAMEON: 
| First insert a blank at a statement separator if any | 
Gi a EO A II O A | 


S me "See UP d :F(PAD 1) 
N = N- 1 
PAD = EQ(N,0) S : S (RETURN) 


Fr UA c" MCCC KLEIN CC DEAS | 
{ PAD RT is a flag to indicate whether padding should begin | 
| from the right (2-1) or from the left (20). | 
A A ———— ee 
PAD_1 S = EQ(PAD RT, 1) REVERSE (S) 


xA E Go ccc IC Scc: c CC C CMM MEM MCI me eg ee MCI QE GM MN CES ee ee 
| Inner loop: Remove a prefix from S at an internal blank. | 
IĮ Place it onto PAD with an extra blank. Keep looping until | 
I N is reduced to 0. | 
A AN A O IN ICT | 


PAD LOOP S INTERIOR BK = :F (PAD AGAIN) 
PAD = PAD T ''' 
N = N- 1 GT(N,1) :S(PAD LOOP) 


fum MMC EM MEINEM ED c eC CC CEN CNN E LIC EISE | 
| Falling through indicates completion. Append S; reverse | 
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> O A ru. 


| if necessary; change flag for next time; and return. { 
A AAN ES eerie | 


PAD_DONE 


PAD = PAD S 
PAD- =  EQ(PAD RT,1) REVERSE (PAD) 
PAD RT = 1 - PAD RT : (RETURN) 


| XOU ES CCS I MM LE EXON MD OR ee CAMCN AN MEN IC DM MEC ee VELIE ECCE CE CES | 
| Here if no more holes remain. If PAD is null at this point | 
| return; there are no holes. Otherwise restore PAD and S. | 
ELIL—————— ee a eS i ee ee 


PAD_AGAIN IDENT (PAD) :S(PAD DONE) 
S = PAD S 
PAD = : (PAD LOOP) 


| oes cu [MCCC EM NEN C MI M CEN a MC eee 
| Here to remove N characters. | 
| ———— —————  ———— —»^P——————————————— ——— ————— ———se— H—' | 


PAD REDUCE N = LT(N,0) N+ 1 :F (RETURN) 
PAD "o f = tt : (PAD_REDUCE) 
PAD_END 
Names_referenced Name Type Where defined 
by PAD: SPACING Function Program 10.5 
REVERSE Function Program 3.6 
Epilogue 


The design of PAD was based on the assumption that N is small 
compared with the size of S and indeed that N does not usually 
exceed the number of blanks in S. If this were not the case 
then a more efficient procedure would be to make one pass 
through to determine the number of blanks in S, compute the 
number of blanks to be inserted and, in this way, accomplish 
the insertion in 2 passes. 


The method given saves the initial pass of counting tre number 
of blanks in S and is very much more efficient when 0, 1 or 2 
blanks are to be inserted in S. 


Oy se ee ee | 

(|! Program |! SPACING(S) will determine the spacing of the 
Ui 10.5 E string S. If S has been B-normalized this 
|! SPACING || will yield the number of print positions oc- 
EESÁ —————— cupied by the string. 

r” 


a ES 
| SPACING (S) will return the spacing of the string S. | 
Cn a i A O MP EE A OI NA eral 


DEFINE (' SPACING (S) *) 
IF OVERSTRIKE = BREAK(BSPACE USCORE) 

: (SPACING_END) 
E 
| If no special characters exist, just return the number of | 
| characters in S. | 
| ————————— —"— ee 
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SPACING SPACING =  SIZE(S) 
S IF OVERSTRIKE < F (RETURN) 
CER E A II TE | 
| Otherwise deduct 2 for each backspace and one for each | 
| underscore. | 
PE | 
SPACING = SPACING - 2 * COUNT (S,BSPACE) 
+ - COUNT (S,USCORE) : (RETURN) 
SPACING_END 
Names_referenced Name Type Where defined 
by SPACING: COUNT Function Program 3.4 
BSPACE * Character 
USCORE * Character 


* indicates name is referenced in the initialization section. 


Epiloque 


The two calls to COUNT do not render the most efficient coding 
but the convenience and the fact that overstrike characters 
are relatively rare suggests its use. 


[eoe 

{{ Program {| MINP(S) will return the minimum number of 
li 10.6 11 print positions needed to print the string 
B MINP 11 S. 

A | 


DEFINE ('MINP(S) T*) 
: (MINP_END) 


NA A A ee A A E A ee S A G AO CHER 
| Entry point: if JUSTIFY is 0, the contraction points are | 
| ignored. Just return SPACING in this case. | 
A  P———————— Á— ——————— v——— sp 
MINP MINP = SPACING(S) 

EQ(JUSTIFY, 0) : S (RETURN) 


NN CEN LE A IA MDC IM pa eg eee e ECHE . 
| Reduce MINP by one for each contraction point found. | 
|—— ———— d————— SPÓá—— —A—— ————————————————————————— À——  -—PÀ-- 


MINP = MINP - COUNT(S,' ') : (RETURN) 
MINP END 
Names referenced Name Type Where defined 
by MINP: SPACING Function Program 10.5 
COUNT Function Program 3.4 


JUSTIFY Global Flag 
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ee IERI? Ge _ EE ee ee WIIEDNIMUEE 


(AAA oe ee 


E Program E Hyphenation, while not strictly necessary, 
E 10.7 {| serves to eliminate rivers and lakes in 
(| HYPHENATE |! documents with right edge allignment. This 
_ 3 is particulary true with small column 


widths in which the same amount of expansion is concentrated 
in relatively few gaps. An exact algorithm for hyphenating 
words does not exist short of storing large numbers of special 
cases. In the extreme, a complete dictionary could be stored 
but such a massive amount of information would have to be 
placed on secondary storage since it would be uneconomical, if 
not impractical, to store the dictionary in high-speed 
storage. But secondary storage is unsuitable to this problem 
since accesses must be made frequently (almost once per line). 


The algorithm we will present will not depend on dictionary 
methods other than that a relatively small number of suffixes 
must be stored. Its error rate is low but not zero.  For- 
tunately, no great tragedy befalls if an occasional word is 
mishyphenated. In the last analysis it becomes a balance of 
aesthetics. How many lakes and rivers are worth how many 
mishyphenated words. 


Perhaps the simplest published hyphenation algorithm appears 
in Rich and Stone [1965]. The basic method involves examining 
pairs of letters out of context and deciding whether this pair 
is or is not suitable for hyphenation. This algorithm turns 
out to be too weak (not enough break points are discovered) if 
too few letter pairs are permitted, or too erroneous 
(producing a break at a non-syllable boundary) if too many 
letter pairs are dubbed as breakable. Letter pairs do not 
hyphenate uniformly enough to be used as a sole guide for 
hyphenation. 


The program given here is based on an algorithm developed by 
M.R. (Molly) Wagner [1971] for incorporation in a text format- 
ting program called Roff [McIlroy 1971]. Wagner extended Rich 
and Stone's work to include an examination of suffixes before 
looking for letter pairs and also greatly reduced the number 
of letter pairs considered breakable. With these improvements, 
the error rate has been reduced to the neighborhood of 1% and 
the number of hyphenation points found, while far from total, 


is nonetheless satisfactory. This book uses the hyphenation 
algorithm described, with the proviso that the user can over- 
ride the automatic hyphenation of specific words. Very few 


overrides were required. 


Most hyphenations found are by suffix removal. Three distinct 
kinds of suffixes are defined. A hyphenating suffix is one 
before which one can hyphenate. For example 'less' and ‘ness! 


are both hyphenating suffixes. If 'carelessness' is to be 
hyphenated with room for only 6 characters the ‘ness! is 
stripped off first. There are still too many characters and 
so the 'lesst is stripped off. The word is then hyphenated as 
‘care-' on one line followed by 'lessness' on the next. An 


inhibiting suffix is one which is not hyphenated and, 
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moreover, upon encountering one, the suffix hunt is given up 
and letter-pair (or digram) testing ensues. For example, 'ing' 
is an inhibiting suffix. If it is detected as in  'winning' 
the suffix is stripped and digram testing begins with the 
double-n. This digram is kreakable so that the word is 


hyphenated  'win-ning'. Also, an inhibiting suffix will ab- 
solutely prohibit hyphenating at a point where digrams might 
indicate that hyphenation is allowed. Otherwise 'else' might 


be hyphenated 'el-se'. A neutral suffix is one which is not 
hyphenatable but, unlike the inhibiting suffix, does not 
Signal the start of digram testing. More suffix removal can 
take place. For example ‘est is a neutral suffix. In 
'harnesses'! the "est is stripped and a further suffix search 
yields 'ness' as a hyphenating suffix. The word can therefore 
be hyphenated as 'har-nesses'. 


The second phase is digram testing. Here we find the in- 
teresting phenomenon that most letter-pairs are considered 
hyphenateable whereas most pairs of letters that actually ap- 
pear within English text are not. For example, every digram 
of the form consonant-vowel is non-separable unless the 
consonant is  'x'. Also every digram of the form vowel- 
consonant is non-separable unless the consonant is 'q'. But 
these pairs so predominate in English that it is not hard to 
find words in which no breakable digram appears; ‘'hyphenate'! 
itself is one such word. 


Finally, we insist on at least one vowel before and after the 
break. This is so that we do not hyphenate words like ‘bless! 
which only appear to have a hyphenating suffix, or words like 
‘returns! which would otherwise be hyphenated 'retur-ns'. Also 
we do not hyphenate words with strange characters in them 
other than certain leading and trailing punctuation and an 
initial capital. Otherwise, paragraphs like this and the last 
2 might prove awkward to decipher. 


Gate eg DC D LC LC MACC d Pe ae a CDM AN 
HYPHENATE (RWORD, MIN) will indicate where within the rever- 
sed word (RWORD) a hyphenation point can be found. MIN 
indicates the number of characters by which the word must 
be diminished in order that the line may include this 
word. A global variable, HYPHEN, will be set to '-' if a 
hyphen must be added to the word. HYPHENATE will fail if 
no hyphenation point is found. As an example, HYPHENATE( 
'niatbo',3) will just succeed and return a value of 4. 
HYPHEN will be set to '-'. The 2nd argument may be < 0 in 
which case the first nontrivial hyphenation will be found. 
mec (—————É——— "—— — c——— ——— —— —— ————— E E 


DEFINE ('HYPHENATE (RWORD, MIN) K,C,L') 


Initialize suffix matching patterns. Construct 3 patterns 
INHIB SUFF, NEUT SUFF, and HYPH SUFF corresponding to the 
3 types of suffixes mentioned in the text. They will be 
applied to a reversed version of the word to be 
hyphenated. 


EL lIllll o A M MM E ce d qi C ere ers 
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INHIB SUFF = OR(UPLO(PALREV ('ED, (GLSV) E, (GQ) UE, ING, EST, "))) 
NEUT SUFF = OR(UPLO(BALREV( ' (AI) BLE,LY,S,ES,' ))) 


* | ANY (!.5,:?) °) 
HYPH SUFF = OR(UPLO(BALREV ( 
+ ' TURE, (CGST) IVE, (CDMNT) IAL, FUL, (CGST) IAN, ' 
+ ' (CGST) ION, SHIP, (LN) ESS, (CGST) IOUS, (CDGLMNTV) ENT, * ))) 


Orie a rg ge ee A NO | 

| DIGRAMS is a string representing all letter pairs which 

| are regarded as breakable. Thus 'xa' is a breakable pair. 

| 'à* stands for the set of vowels (aeiou) and '~' stands 

( for complementation. Hence '=(0)B* means that all 

| consonants followed by a 'b' are breakable; also '-~(@NS)C'! 

{ means that any vowel, 's' or 'n', when followed by a 'c' 

| is NOT breakable. 

Noc Cc CM" —— —— — qu — UM 
DIGRAMS = 

'XA,>(0)B,-(9NS)C,-(9R) D, XE, ~+(@) F,^ (@N) G, S (CGPSTW) H,XI,! 

"~ (0) 3,2 (@CLNS) K, +(@BCFGPTY) L,- (@Y) M, > (GKSY) N, (AX) O, ' 

'~ (SY) P, = (S) Q, (JKLMNRSVXZ) R, ^ (@KLNWY) S, ~ (@FHSY) T, XU, ' 

'2 (8) Ve~ (3S) W, = (8) X, (QUXY) Y,^ (3C) Z' 


+++ + 


Ra a A E E E SERE | 
| Convert à to vowels, and find complement if ~ is present. | 


EA A E AAA AAN E A AAA | 
HYPH D1 DIGRAMS 'Q' = ‘'AETOUS :S (HYPH D1) 
HYPH D2 DIGRAMS '~' BAL. T = *(* DIFF(UPPERS ,T) ')' 

* :S(HYPH D2) 


| MERIRCHOKIC EM I MM IR MEC C MD UE E C M CM C CDM C C E i ED CMM CM NR | 
( Convert to lower case and reverse to make scanning easier. | 


( Then prepare a table (DIGRAM_TBL) of all those breakable | 


| digrams. | 

eeepc ———————— —— ——MÓ—Á————— ———— sn | 
DIGRAMS = BALREV(UPLO( DIGRAMS )) 
DIGRAM TBL = TABLE (30) 

HYPH_D3 DIGRAMS LEN(1) . C 

* (*(* BREAK(')') . CC *)' | LEN(1) . CC) 

4 (*,' (t RPOS(0)) = :F(HYPH Di) 
DIGRAM_TBL<C> =  ANY(CC) : (HYPH_D3) 

HYPH D4 


Ce i ey AA NA a ee 
| HYPH_PAT is the chief hyphenating pattern combining all | 
| previous patterns into one. It will look for a break at | 
l least MIN spaces from the back of the string and will set | 
| K to equal the break point. | 
AE II EI RR 
HYPH PAT = HYPH_SUFF @K (*GT(K,MIN) | FENCE *HYPH_PAT) 

* |  NEUT SUFF FENCE *XHYPH PAT 

+ | (INHIB SUFF | NULLI) FENCE ARB LEN(1) $C @K 
* *GT (K,MIN) *DIGRAM_TBL<C> 


q a q C c C cC | 
{ Other miscellaneous patterns follow. | 
AAA ARE AS A E AS | 


TRUE WORD = POS(0) (ANY('.;),2?") | NULL) 
+ SPAN (LOWERS_ '-') (ANY(UPPERS '(') | NULL) RPOS(0) 
FIRST VOWEL = BREAK(UPLO( 'AEIOU! )) LEN(1) 3L 


FOLLOWING VOWEL = POS(0) TAB(*K) BREAK(UPLO('AEIOUY!)) 
: (HYPHENATE_END) 
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| Entry point: Check to see if a normal word is there. Set | 
| MIN to be at least beyond the first vowel. | 
-———— Proees€ A ee ER See SION CRUISE 


HYPHENATE 


RWORD  . TRUE WORD :F (FRETURN) 
RWORD  '-! :S(HYPH 1) 
RWORD  J FIRST VOWEL :F (FRETURN) 
MIN = LT(MIN,L) L 


anc c ccc MC LMCOD IC MN CM CMM ENIM INCL D MODA A EE ADM M DM ESI EM M CC O EMEN CAMS 
{ Scan for a hyphenation point; check for following vowels. | 
| Insist on more than one character preceding the hyphena- | 
| tion point. | 
| oec O E ES | 


RWORD  HYPH_PAT :F (FRETURN) 
RWORD  FOLLOWING_VOWEL : F (FRETURN) 
LE (SIZE(RWORD) - K,1) : S(FRETURN) 


E E ee ee egy ee 
{ Return K and set HYPHEN toa '-'. | 
pP —————— ÁÀ———————ÀÁ——— — e — '—— ——————Á————— M1 

HYPHENATE = XK 

HYPHEN = '-! : (RETURN) 
E A ERA A TS | 
| If the word already contains a hyphen, this is the only | 
{| point at which it may be hyphenated. | 
A A A a ee 


HYPH_1 HYPHEN = 
RWORD  '-' @K *GT(K, MIN) : F (FRETURN) 
HYPHENATE = K- 1 : (RETURN) 
HYPHENATE_END 
Names_referenced Name Type Where_defined 
by HYPHENATE: BALREV * Function Program 3.8 
OR * Function Program 8.9 
UPLO * Function Program 2.1 
DIFF * Function Program 3.10 
UPPERS_ * String Program 2.1 


* indicates name is referenced in the initialization section. 


Epilogue 


The coding of HYPHENATE was based on the desire to make it 
easy to see and modify the suffixes and letter pairs on which 
the algorithm is built, but at the same time to produce an ef- 
ficient subroutine. The suffixes and digrams have therefore 
been transformed by the initialization section from a viewable 
format to a swiftly runnable one. The result of the pre- 
computing is a single pattern (HYPH_PAT) used to scan the word 
in reverse until a hyphenation point is found in which case 
the variable K is set or is not found in which case the pat- 
tern fails. Suffix testing and removal are done by essentially 
OR'ing the various suffixes together with an appropriate 
degree of sophistication as contributed by the function OR 
(Program 8.9). OR contributes to efficiency by consolidating 
strings beginning with the same first character. 
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Digrams are done a little differently. One could have taken 
the OR of all breakable digrams to produce a pattern of the 
form 


tat ANY(...) | 'b! ANY(...) | 'c' ANY(...) | +... 


This would require 26 tests for each character within the WORD 
to be hyphenated until a break point was found. A more direct 
approach is a variant on the pattern 


LEN(1) $ C *DIGRAM TBL<C> 


where the search through 26 alternates is replaced by the 
lookup in the table. Since the look-up.is done by hash coding 
it can and is accomplished faster than ORing. 


But it is interesting to note that it is not a great deal 
faster. Evaluating an unevaluated expression requires suf- 
ficient time that the tradeoff in speed occurs at about 10 
alternands. If the pattern were intelligent enough not to take 
alternatives after once finding a character it would avoid 
some needless testing and the average number of trials would 
be 13, not 26. Moreover, if the sequence of characters is ar- 
ranged in order of the frequency of their appearance in 
English, we may expect to wait on the average of perhaps only 
6 alternands. This suggests a pattern of the form 


te! FENCE ANY(...) | 't' FENCE ANY(...) | ... 


This pattern is slightly more awkward to use since it will 
succeed or fail at the first character position. Tt must be 
moved against the subject string by explicit programmer con- 
mands. Since the speedup of this approach cannot be great (if 
even positive) we leave its encoding as an exercise. 


ee ne ee o 

(! Program ji Printing a line which contains backspace 
E 10.8 E characters is not easy using a standard line 
E IMAGE ii printer. In fact, it is not immediately 
t__________.._-___—-_f clear how we can even package this activity. 
We certainly would like to focus all print line extraction in- 
to a single function. But what is this function to return? 


If the function were to go ahead and print the line, complete 
with overstrikes, we would not have a very flexible function. 
Since we have no idea of the use that is to be made of the 
line it would be rather poor practice to commit ourselves in 
advance to any particular disposition. We could return a 
linked list of lines, one for each overstrike or a string of 
consecutive lines (assuming we know the line width these could 
be later separated) but these 2 methods imply the necessity of 
disentangling the strings once they were brought back, a 
process easily enough done but just as soon avoided if 
possible. Rather than return all the lines at once we will 
have IMAGE return just one particular line, the line numbered 
Is This will help us in 2 ways. Not only will it be easier 


to use in the normal case, but it will provide us with random 
access to certain levels of lines. If, for example, we inter- 
pret the 3rd overstrike as actually a superscript, we could 
print that line first before going on to the others. 


IMAGE (S,1) will return the Ith overstruck image of the B- 
normalized string S; for I=1 the line proper is returned, for 
I-2, the set of first overstrikes is returned, for I=3, the 
set of 2nd overstrikes, etc. For I=0 the underscoring of sec- 
tions set off by USCORE's is returned. If IMAGE(S,I) does not 
exist for some I, the function will fail. Note that for I-1 
the function never fails. 


For example, let 


S = 'THE MQUICK BRO-/WNM FO-/X' 
then 
IMAGE (S, 0) = Ž o feo ER i 
IMAGE (S, 1) =  'THF QUICK BROWN FOX' 
IMAGE(S,2) = ' / /' 
IMAGE (S, 3) fails 
Printing a line reduces to the following program. First we 


associate OVER with a format which insures overstriking. 
(PRINTER is a variable designating the printer unit, is 
installation dependent, and must be given by the user.) the 
width of the printer is assumed to be 132. 


OUTPUT (.OVER, PRINTER, ' (1H* , 132A1) ') 


OUTPUT = IMAGE(LINE, 1) 
I = 1 
LOOP I = I * 1 
OVER = IMAGE(LINE,I) : S (LOOP) 
OVER = IMAGE(LINE,O) 


Note that nothing is printed ina statement in which IMAGE 
fails. 


Even this activity, however simple and straightforward, can be 
avoided if we had the ability to return a data object having 
more dimensions that the singly dimensioned string. Such data 
objects exist; for example an extended version of  SNOBOLU, 
called SNOBOLUB [Gimpel 1972], has a 3-dimensional aggregate 
of characters as a special datatype (called a block). The 
system which produced this text was written in SNOBOLUB. In 
this system not only does a function return an overstruck line 
as a value but there exists a function called TYPSET which 
returns an entire paragraph complete with overstriking. 


| VEN A A eS pee ee Cee eae ene ee ae See. MM Pe ee ee ey ees nee 
| IMAGE(S,I) will return the Ith print line associated with | 
| the string S. It will fail if there is no Ith line. S is | 
| assumed to be B-normalized. | 
A A A A ccc DII CES O i i A AS | 


DEFINE (' IMAGE (S,1)C,BU,T,T1') 


IF OVERSTRIKE = BREAK(BSPACE USCORE) 
IF BSPACE =  BREAK(BSPACE) 
IF USCORE = BREAK (USCORE) 


: (IMAGE_END) 
AAA RAS | 
| Entry pcint: Fan out to various locations depending on | 
( value of I. | 
antares ES ne EE E A O A SSE IDE ICM 
IMAGE LE (1,0) :S (IMAGE USCORE) 

GT (I, 1) :S(IMAGE BSPACE) 


I = 1: Ignore USCORE's, BSPACE's and characters following | 
BSPACE's. | 
sume rc ar MSAN EE MN PCS pU M IN EA UR T cour cire eR PUR O el ERED | 


IMAGE = S 

IMAGE IF_OVERSTRIKE : F (RETURN) 
IMAGE_1 IMAGE BREAK(BSPACE USCORE) . T 
+ (USCORE | LEN(2)) = T  :S(IMAGE 1)F(RETURN) 


A a EI LE 
| For line 0 come here. Make fast scan for USCORE failing | 
| if none exists. BU will be a convenient abbreviation for | 
| BREAK (USCORE). Replace all up to the first USCORE by | 
| blank. Replace material between USCORE's by ' 's. | 
COPE EEE O — C ee REEE ENS 


IMAGE USCORE 


S IF USCORE : F (FRETURN) 
BU = BREAK (USCORE) 
IMAGE UL 
S BU . T USCORE (BU. T1 USCORE | REM. T1) - 
IMAGE = IMAGE DUPL(' *,SPACING (T) ) 
+ DUPL(' ',SPACING(T1)) 
S BU :S(IMAGE UL) 
IMAGE = IMAGE DUPL(' ',SPACING(S)) : (RETURN) 
A MCCC ED I MMC C ee E 
| For I > 1 come here. Set up pattern PAT.C specially com- | 


| puted for level I. | 
AAA a i AA 


IMAGE_BSPACE S IF BSPACE :F (FRETURN) 
PAT.C = BSPACE LEN(1) . C 

IMAGE B1 I = I- 1 GM(I, 2) :F(IMAGE B2) 
PAT.C =  BSPACE LEN(1) PAT.C : (IMAGE B1) 


ge NM ICE IN MIC MICE OCDE ADI MD GM M LC ee DEDI CM C LM MD DLP E | 

(| See if an Ith overstruck character exists. Set it to C if | 

| it does. | 

a a — ————— À———— u————— — M — e] A CT | 

IMAGE B2 S POS (0) BREAKX(BSPACE) . T PAT.C = 

* :F(IMAGE B3) 
IMAGE = IMAGE DUPL(' ',SPACING(T) - 1) C 


ES eee pe a ye ILL IX DIN QC xD MCCC CIIM ICE 
| Now remove any remaining BSPACE's. If the right neighbor | 


| does not exist we are free to return. | 
-————————————————— O O | 

S  POS(0) ARBNO(BSPACE LEN(1)) NOTANY(BSPACE) . C=C 
+ :S (IMAGE_B2) F (RETURN) 


DEI A A A Ag CD HY IX! dX ILC eg OP YT eg eee 
( The clue to whether any characters at level I exists is | 
| found in IMAGE. If it is still null no Ith level charac- | 


| ters have been found. | 
REC TEC T —U————————————— ————————fUÜ)T—»—— uo "M(P i! 


IMAGE B3 IDENT (IMAGE, NULL) : S (FRETURN). 
IMAGE = IMAGE DUPL(' ',SPACING(S))  : (RETURN) 
IMAGE END 
Names referenced Name Type Where defined 
by IMAGE: BSPACE * Character 
USCORE * Character 
SPACING Function Program 10.5 
BREAKX Function Program 8.2 


* indicates name is referenced in the initialization section. 
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UTOR E 
| Exercise 10.1 | Modify BNORM so that it fails if a B- 
tL————————————-4 normalized version of the string does not 
exist. 


quán E 

| Exercise 10.2 | Prove that if S, and Sg are B-normalized 
t-——————— then the concatenation S, So is B- 
normalized. 


E | 

I| Exercise 10.3 | The text says that in order to have an in- 
A version in the print position numbers we 
must have at least one double BSPACE. Intuitively this is ob- 
vious. Can you prove it? 


A eS ee ee ee 

| Exercise 10.4 | Prove that step (ii) of the BNORM algorithm 
t———— (Prog. 10.1) preserves the property of 
being right-balanced. 


NENE. 
I| Exercise 10.5 | Suppose string S, prints the image I, and 
t-————————— string Sə prints the image I>. Write a 


pattern-matching statement to determine whether the image Iz 
is a subimage of I,. 


Coxe ep a | 
| Exercise 10.6 { Modify INORM to process separately the case 
tLL—————————————4 of a single overstrike. 


| Exercise 10.7 | Rewrite PR_POS (in INORM, Prog. 10.2) to 
AA use BREAK rather than ARB to find a BSPACE. 
Assume the string to be matched is B-normalized. 


E77 TUS 
| Exercise 10.8 | (a) How would the definition of 


3 distinguishable change if overstrikes of 
the same character are not regarded as different? 


(b How would the definition change if all nonprintable 
characters were regarded as blank? Assume the nonprintables 
including blank are contained in the string NONP. Also do not 
make the assumption in (a). 


(c) How would INORM be modified in each instance 


(7 [eR IUE. | 

| Exercise 10.9 | (a) Modify LINE so that the cost (UF) of 
——— compressing a line be two per char, while 
the cost of adding a blank and hyphenating remain at 1  (re- 
quires modifying one statement). (b) Modify LINE so that the 
cost (per char) of compressing a line is UF C, the cost of 
padding is UF P and the cost of hyphenating is UF H. 


og IT CAU. cu erp HEN 

| Exercise 10.10 | Modify PAD (Prog. 10.4) and MINP (Prog. 
AS» 10.6) so that any blank following a spe- 
cial character can be squeezed out. An example of a set of 


special characters is ',)s:(;'. 


V^ cuc ECCE 
| Exercise 10.11 | What is the value of HYPHENATE(RWORD, K) 


A for K= 2, 4, 6, 8 where 

(a) RWORD = REVERSE ('investment') 

(b)  RWORD = REVERSE ('co-operation') 

AAA A O AN 

| Exercise 10.12 | Modify HYPHENATE so that it will use not 
NS Only '-* as a break character but any of a 
set of characters in the string BRC. Slash (/), for example, 


might be such a character to be broken in phrases such as 
‘input/output. 


(Se a ee ee | 

| Exercise 10.13 | Modify the hyphenation algorithm so that 
AAA  digrams are tested in the order of the 
frequency of letters in English ('etoanirshdlcwumfygpbvkxq]jz') 
and such that testing at a particular position ceases when the 
letter is found. 
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| Exercise 10.14 | Modify  HYPHENATE SO that any word 
AÑ consisting entirely of upper case letters 
will also be hyphenated. 


Ce ee UU UT 

| Exercise 10.15 | (a) Write a function PRIMAGE(S) which will 
t—-———— print the image of the B-normalized string 
S. (b) Given 2 strings, S1 and S2 use PRIMAGE to print them 
on the same line with S1 beginning in column 10 and S2 begin- 
ning in column 60 (assume the spacing of S1 is less than 50). 


We ne ee ee DE DESEE EDD | 
| Exercise 10.16 | Using PRIMAGE() of the above exercise, 
t—-———— print the  B-normalized strings S1 and S2 


on the same line. That is, overstrike one on the other. 


[eet ey Ye pee | 

| Exercise 10.17 | Playboy magazine, for reasons best known 
t————- to itself, wishes the lead page of the 
Playboy pictorial to be laid out in a ‘coke bottle! shape. 
Assume the line widths, ranging froma maximum of 36 to a 
minimum of 22 are contained in a string (LENGTHS) separated by 
commas. Assume the lead paragraph is in a variable P. Assume 
a page width of 60 with the column centered in the page. Using 
the function PRIMAGE from Exercise 10.15 write the SNOROL4 
program to satisfy Playboy's request. 


E A 
| Exercise 10.18 | Suppose that the 3rd overstrike represents 
AS  superscripting and the 2nd overstrike 
represents subscripting so that 

tA -1 = 2 © eN! 


prints as 


Using IMAGE, print such an object. 


po cvm m 
| Exercise 10.19 | Print a string with exponentiation such as 
Lo —— "— m | 


'A** (M41) = BEEN + C*xM! 


in such a way that parenthesis (if any) are stripped from the 
exponential and the exponents are superscripted such as 
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EE CUI i AO A € EE EEE cae CD eee oe CO A ee ED ante. Cut cee AA A A cee ates GED eo nee et ee rd 


Assume that the string contains no BSPACE's and whenever '*x! 
appears it means superscript the following character unless a 
'(* appears in which case the parenthetical expression is 
superscripted. Assume that the superscript does not itself 


have superscripvting. (Hint: this can be done in four state- 
ments using IMAGE and BNORM). 


[TNR T P TTE 
| Exercise 10.20 | Extend the previous exercise to handle ar- 
CA» bitrarily nested exponentiation. 
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(11 ne of the reasons for writing in a higher level 
if If language is to free oneself from the entanglements of 
Ii 41 individual bits and the sometimes sordid details of the 
(t—| particular machine on which one is running. A price is 
LJ normally paid for this in terms of time and/or space 
efficiency of the resulting program but one is presumably wil- 
ling to pay this price if the savings in programming time are 
compensative. Then why, the reader may ask, should we bother 
about timing and implementation since the former we have 
agreed is relatively unimportant andthe latter represents 
detail from which we wish to escape? The answer is that al- 
though most programs are small and can (and should) be written 
without regard for the time they consume, most large programs 
come to grips with the efficiency question sooner or later. 
Large programs may exceed critical storage bounds or they may 
consume so much time that their utility is in question. Some 
knowledge of timing is useful not only to improve the speed of 
an existing program but to estimate the cost of running 
programs not yet written. It may well be that a program writ- 
ten in SNOBOL4 will be too slow or inefficient for a given 
application and it will ke helpful to learn this before it is 
written. 


Describing a system as large as an implementation of the 
SNOBOLU language can neither be easy nor quick. TO make mat- 
ters even more difficult there are several SNOBOLU processors. 
There is the oriainal MAcro Implementation of SNOBOL4 
{Griswold 1972] which we refer to as MAINBOL, there is a com- 
piler version for the IBM 360/370 called SPITBOL (Dewar 1971] 
and a small fast interpreter for the PDP-10 called SITBOL 
[Gimpel 1972, 1973a]. In addition, the macros of MAINBOL have 
been expanded to run on several different machines including 
the IBM 360/370, CDC 6000, Honeywell 635, Univac 1108 and the 
PDP-10. The process of macro expansion for yet newer machines 
continues at this writing with unabated ferver so that this 
list is not, and is not intended to be, exhaustive. 


The primary purpose behind SPITBOL was speed and the resulting 
System is 7-8 times faster than MAINBOL. SITBOL's chief 
concern was storage and the system is less than one-third the 
size of MAINBOL. In spite of the differences in design goals, 
the implementations of these systems are fairly similar. 


E an ghana eae renege eee 

| E£%£% ymbol Tables | A symbol table is programmer jargon for 
| $ f———————— a table of information that can be 
| £888 | referenced on a name basis (the symbol). For exam- 
| $ | ple, a telephone directory can be regarded as a sym- 
| 848% | bol table of sorts where the symbol is a person's 
L——————J name and the information to be looked up is his tel- 
phone number (and possibly other information such as his 
address). In principle, a symbol table could be implemented 
as a long list and a search could be made by comparing a given 
symbol with every one on the list. This is obviously too 
inefficient to be practical. In the telephone directory, the 
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symbols are arranged alphabetically to permit rapid searching. 
In general, a symbol table is organized in such a way as to 
avoid a lengthy linear search. 


A common method of implementing a symbol table is by means of 


a hashing technique, illustrated in Figure 11.1. The Hash 
Array is a fixed-length array of pointers to symbol table 
entries. Each symbol table entry contains the name of the 


symbol (for comparison purposes), information associated with 
the symbol and a pointer to the next symbol table entry (if 
any). Hence, each pointer in the Hash Array may be regarded 
as heading a list of symkol table entries. 


when a symbol such as ALPHA is looked up or entered into the 
table, a so-called hash number is computed from the characters 
'ALPHA' which is a number between 0 and L-1 where L is the 
length of the Hash Array. This hash number is used to 
reference into the Hash Array and hence it designates a list 
of symbol table entries. If a symbol table entry for ALPHA is 
in the table, it must be in this list. Thus the time to locate 
ALPHA in the table is reduced by a factor equal to 1/L but is 
increased by the time needed to compute a hash number. 


The hash number must be reproducible so that given the charac- 
ters 'ALPHA' the same hash number is always produced, but the 
method for computing the hash is otherwise arbitrary as its 
name would suggest. It should provide a good mix so that all 
locations in the Hash Array (sometimes called buckets) are 
referenced with approximately equal probability. Also the 
computation should be quick. For example, one may take the 
first 4 characters exclusive-OR'ed with the last 4 characters 
and divide by the length L of the array. The remainder is 
usually an acceptable hash number. Note that the hash number 
does not uniquely represent the symbol. In Figure 11.1 both 
ALPHA and GAMMA have the same hash number. 


Symbol tables are very important; they form the heart of vir- 
tually every assembler, compiler and interpreter. A symbol 
table provides the link between an external name (symbol) and 
an internal block of information about that symbol. One need 
merely reflect on the telephone directory example to see the 
importance of this. Names in a program remain fairly stable 
even though they may translate into different internal ad- 
dresses from run-to-run just as people normally retain their 
names even though they may be associated with different 
telephone numbers over the course of their lifetime. 


For  SNOBOIU implementations, the information typically 
retained in the symbol table entry for, say, ALPHA is the 
value of the natural variable ALPHA, a pointer to function in- 
formation if ALPHA is a function and a pointer to an internal 
code location if ALPHA is a label. Also, if ALPHA is a keyword 
(it is not) information may be present to indicate its value. 


For interpreters with the power of SNOBOL4, the symbol table 
is especially important; it remains in core during execution 
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Figure 11.1 


A symbol table containing three symbols ALPHA, 
BETA, and GAMMA. 
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and there are language features which depend on this. For ex- 
ample, indirect referencing, such as: 


A = ‘ABC! 


$A = 17 


requires that 'ABC' be looked up in the table so that the sym- 
bol table entry associated with 'ABC' (also called a variable 
block) can be plugged. The indirect goto is another example 
of where the symbol table is queried at run-time. As another 
example: 


OPSYN('ALPHA', 'SIZE!) 


results in a copy of the function field of the variable block 
for SIZE into the function field of ALPHA. Conventional 
languages such as PL/I and Fortran do not retain a symbol 
table at run-time and hence cannot provide these capabilities. 


Whereas each of the SNOBOLU processors retains a symbol table 
to house symbols required for an associative lookup, MAINBOL 
uses the symbol table for yet another purpose, viz. to store 
strings. All data strings are stored as symbols table entries. 
A certain economy of concept is thereby achieved at the ex- 
pense of significant inefficiencies in string handling. For 
example,  TRIM(INPUT) in MAINBOL will read a record, hash it 
into the symbol table and call TRIM which deletes trailing 
blanks and hashes the remainder into the symbol table. A11 
such hashing is avoided in other processors. 


While interpreters generally retain the symbol table, com- 
pilers generally do not. Since it requires a volitional act 
for an interpreter to expel the symbol table and a volitional 
act for a compiler to produce it along with working code, the 
correlation seems to be the result of inertia rather than 
reflecting any essential relationship. In fact, exceptions do 
occur. Some compilers produce a symbol table optionally for 
debugging while some interpreters optionally expel the symbol 
table for efficiency. 


Ce i eee ee A 

| EXE ypes of Compilers | A compiler, in the most general 
( AS sense of the term, will translate a 
( & | program written in some language into some inter- 
| % | mediate form which can be executed or interpreted by 
( $ | some other program. If the intermediate form can be 


L————J executed directly, the processor is called a com- 
piler, in the narrow sense of the term. Otherwise it is called 
an interpreter. 


One of the most important questions that can be asked about an 
implementation is the form of intermediate code. Into what 
form, for example, will 
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ALPHA * BETA + GAMMA 


be compiled. Different implementations of the same language 
may answer this question in different ways. The layman often 
believes that all SNOBOL interpreters leave the string intact 
to be interpreted anew each time the expression is evaluated. 
This is a kind of interpretation called pure interpretation 
and since the compiler has zero work to do, we will call the 
compiler a type-0 compiler. Sqme languages are implemented as 
pure interpreters (such as GPM, Program 18.8) but SNOBOL4 is 
not one of them. 


A type-1 compiler will convert indivisible syntactic units 


(called tokens) into pointers into the symbol table. For ex- 
ample, the expression above will be converted into 


E E re ee 

| —» ALPHA | 

AA 

| —> *(2) i 

Se 

i —> BETA i 

AAA] 

| —» 4+(2) | 

St 

i —> GAMMA | 
where -—> ALPHA is a pointer to the symbol table entry for 
ALPHA, where —> *(2) is a pointer to the symbol table entry 
for binary *, etc. LISP (McCarthy, 1960] is an example of a 
language which employs a type-1 compiler. 
The searching for, and the conversion of, tokens into symbol 
table pointers is called lexical analysis. Most compilers more 


sophisticated than type-1 nevertheless precede other  proces- 
sing with a lexical analysis. 


A type-2 compiler will rearrange the pointers into a form more 
suitable for execution. This can either be a Polish prefix 
representation in which the functions precede the arguments or 
a Polish suffix representation in which the function pointers 
follow the arguments. Each form is illustrated in Figure 11.2. 


Most interpreters operate on type-2 code. In particular, 
MAINBOL uses Polish prefix and SITBOL uses Polish suffix. 
Polish prefix is slower kut more flexible than Polish suffix. 
It is slower because with prefix code the function is encoun- 
tered first. When the function gets control it calls the 
interpreter to obtain its arguments. This call is necessarily 
recursive and hence slow. In Polish suffix the function is 
called after the arguments have been evaluated; there is no 
need for recursion. But Polish prefix is more flexible because 
certain operators can decide that they do not want to play the 
same game as other operators. Unary *, for example, does not 
evaluate its argument but merely returns a pointer to it to be 
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—> 402) i ' —> ALPHA { 
— > *(2) i —» BETA l 
q —> ALPHA i { —> *(2) | 
{ —> BETA | i —> GAMMA | 
i —> GAMMA { —> +(2) 
 ——— ———————— | LL — — — — Á——————— | 


Figure 11.2 


The result of a type-2 compilation of the expres- 
sion ALPHA * BETA * GAMMA may be (a) Polish prefix 
or (b) Polish suffix. 


evaluated at some later time. In Polish suffix, unary * can't 
decide this on its own but needs the co-operation of the com- 
piler. This leads to other problems. For example, unary * 
cannot be redefined at run-time. 


The types 0-2 compilers are regarded as interpreters because 
the output (intermediate code) is not capable of being ex- 
ecuted directly by machine. A type-3 compiler will produce 


code which can actually be executed. The above expression 
becomes: 

PUSH —» ALPHA 

PUSH — > BETA 

CALL —> *(2) 

PUSH —> GAMMA 

CALL —> +(2) 


where each function finds its arguments on the stack and 
replaces them with the result of its computation. For ef- 
ficiency purposes, registers can be used instead of the stack 
except for very deeply nested expressions. 


A type-4 compiler is one which produces optimal (or near- 
optimal) machine code. The above expression is reduced to: 


LOAD —» ALPHA 
MULT — > BETA 
ADD —> GAMMA 


Most true compilers are combinations of type-3 and type-4. For 
example, Fortran I/O routines and trigonometric functions are 
handled with type-3 calls whereas infix operators (+ * - /) 
and some arithmetic functions such as MAX and ABS are executed 
in-line in a type-4 manner. SPITBOL is almost entirely Type-3. 
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The only operation it does in-line is assignment. The reason 
that, for example, in-line addition can't be done is because 
variables are typeless and the compiler has no way of knowing 
whether A + B is floating point addition, fixed point or mixed 
mode. Assignment, on the other hand, even for strings and ar- 
rays, is comparatively simple since only a pointer and a 
datatype need be copied. 


It should be evident that as the sophistication of the com- 
piler increases (increasing type numbers) the speed of com- 
pilation decreases, the speed of execution increases and the 
flexibility of the run-time system decreases. For example, 
the type-2 rearrangement of operators is done so that 
operators will be where they are needed when it comes time to 
execute. This is faster but less flexible since it means that 
it is practically impossible to change the precedence of 
operators at run-time in a type-2 system; an irrevocable deci- 
sion is made at compile-time. 


Cy E 

| 888% loating Storage | The lack of declarations in SNOBOLU 
| $ rm (E.g., S is a string whose maximum 
| AX | length is 1000) implies that storage is not preal- 
( 8 | located for variables but rather is allocated on 
| £$ | demand. When storage is no longer in use it is freed 


GS automatically by a so-called garbage collection 
process. 


In SPITBOL, SITBOL and MAINBOL the storage allocation scheme 
is basically the same. Allocating storage is ultra-simple. 
When a chunk of storage is needed it is taken from the begin- 
ning of a free region and the pointer to the free region is 
updated. When no free storage is left, the garbage collector 
is called. The first step of collection is a marking process 
in which all accessible blocks are marked as such. This is 
similar in spirit to the function VISIT (Prog. 5.10) anā in 
SITBOL and SPITBOL it is actually implemented in the same way. 
Once the accessible blocks have been identified, they are 
moved together so that further allocations can be performed. 
Before the movement, any pointer pointing into or to a 
floating block must be adjusted. The term floating is used as 
it seems to correctly connote the relative ease by which the 
blocks may be moved about. The incorrect care and feeding of 
floating addresses while implementing a system such as SNOBOLHU 
has led to many an implementation disaster. A useful rule of 
thumb is that one such error will lead to a day's worth of 
debugging sometime in the future. 


It is interesting to note that the predecessor to SNOBOLU, viz. 
SNOPOL3, implemented its marking phase by means of a use- 
count. Every time a variable's value is changed under such a 
system, the use-count on the new object would be augmented and 
the use-count on the old would be decremented. Marking 
consists of looking for nonzero use-counts. Where strings are 
the only datatype, as in SNOBOL3, this is not a bad scheme. 
If one can have structures pointing to other structures, 
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however, the scheme suffers from the prospect that two struc- 
tures pointing to each other may be inaccessible from the rest 
of the world and yet have nonzero use-counts. 


The method of implementing the garbage collector in SPITBOL 
and later copied over into SITBOL was especially clever. After 
visiting nodes in the manner of the function VISIT, the poin- 
ters are left in their reverse direction. This leads to a fast 
pointer adjustment phase as all the floating addresses which 
had been pointing to a floating block are then hung off the 
block in a linked list. The MAINBOL processor uses a more 
conventional marking phase using recursion much in the manner 
of COPYL (Prog. 5.8). Also the use of macros produced a slower 
system. The result is that the garbage collectors of SPITBOL 
and SITBOL are much faster than SNOBOLU. 


| &% natomy of a Processor | This section attempts to 
| $ $ ('————————————————A describe how a SNOBOL4 proces- 
I # % | sor is organized and which parts of it are exercised 
| $#$% | most frequently during the course of executing a 
| $ XX | program. While such an analysis is application and 


CC implementation dependent, certain valid conclusions 
can nonetheless be drawn concerning the running of arbitrary 
programs against such systems. 


Most SNOBOL implementations tend to be implemented as one 
large assembly program and it is often difficult to breakdown 
the resource utilization into different functional compart- 
ments. The SITBOL implementation is an exception. It consists 
of 20 separately-assembled files segregated according to func- 
tion as indicated in Table 11.1. Each section is designated 
with a two or three-letter mnemonic as well as an indication 
of space occupied as a percentage of the whole. The approx- 
imate number of instructions in each section can be computed 
by multiplying the percentage by the total number of words 
(9300). 


The 15.5% figure for I/O in Table 11.1 is surprisingly high. 
It includes code to read and analyze the command string, set- 
up memory, provide a fairly rich collection of system 
facilities and interpret special i/o formats and make suitable 
conversions. The space devoted to the interpreter is padded 
by calls to produce run statistics at job termination plus a 
message interpreter. Hence the 7.3% figure is larger than what 
would normally be considered strictly necessary for the inter- 
pretation of Polish suffix. Also required in interpretation 
is all that machinery necessary to provide the correct number 
of arguments to functions, to evaluate arguments (convert 
variables such as A to the value of A, or convert INPUT to 
the next string read, etc.), and to interpret goto's and react 
correctly to failure. 


The compiler consists of a lexical analyzer (LEX) which makes 
calls on the symbol table manager (SYM) to convert source 
tokens to pointers into the symbol table which it feeds back 


Table 11.1 The Decomposition of SITBOL. Regions are | 
named by a short (2 or 3 letter) mnemonic. The Size is | 
based on the number of words of assembled code and is | 
given as a percentage of the total. The overall size | 
was 9300 (36-bit) words. The storage considered is pure | 
Storage and does not include space for stacks, symbol | 
tables, code blocks, etc. { 

I 

l 

l 


Name Size(X) Description 
| wwe eww wee www we we ew ee ww www ww ww wwe we wwe ew ww ww we we wwe we we we wm we e we we we we we we ow we we we we e 
| IO 15.5 I/O and system interface | 
| INT 163 Interpreter { 
| GC 3.7 Garbage Collector | 
{ SYN 4.1 Syntactic Analyzer y 
| LEX 4.4 Lexical Analyzer ( 
| SYM 7.9 Symbol table manager | 
{ STR 6.1 String handler | 
| SMR 2. 1 Streaming (character set searching) | 
| PG 5.7 Patterns Global (pattern building and | 
| the scanner) | 
| PL 7.9 Patterns Local (built-in functions | 
| and primitives) i 
( NUM 2.1 Numeric functions | 
| CVT 4.4 Datatype conversions (string <==> numeric) | 
{| ARY 2.0 Arrays (allocation 8 referencing) | 
| KW 2.0 Keywords ( 
( TBL 2.9 Tables (allocation, referencing and | 
| conversion) | 
| DFF 3.5 Defined functions | 
| DFD 1. 8 Defined Datatypes i 
| ERR 2.0 Error handling | 
( TRC T5 Tracing | 
{| DATA 7.1 Assembled in strings, character sets, etc. | 


A | 


to the syntactic analyzer (SYN). LEX makes calls on the 
streamer (SMR) to search for one of a set of characters. Thus 
the entire compiler represents 18.5% of the system with the 
syntactic analyzer only 4%. This is surprising in view of the 
great attention devoted to syntactic analysis in the litera- 
ture. The symbol table manager is bloated by an internal sym- 
bol table of approximately 450 words (4.8%) and a number of 
symbol table related functions such as CLEAR() and OPSYN(). 
The actual machinery for locating and installing names into 
the symbol table is actually quite small. 


The relatively large quantity, 7.9%, of code for PL (Patterns 
Local) is attributable to the relatively large number of 
built-in patterns such as POS(n), BREAK (S), BAL, etc. 


The SITBOL system has a profiling capability which indicates 
where the system is spending its time. One can obtain a user- 
oriented histogram (via statement numbers) or a system- 
oriented one (via absolute addresses). This, coupled with the 
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physical segregation previously described makes it fairly easy 
to determine the percentage of time devoted to each subac- 
tivity. Table 11.2 summarizes the results of running the 
profiler for 6 typical string applications. The last column 
indicates a composite figure Obtained rather arbitrarily by 
averaging the other 6 figures. 


E IA 
{ Table 11.2 Shows the percentage of time spent in 


f various regions of SITBOL for a variety of string- 
| processing. problems. 


| Region 
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L6 is a compiler. Renum renumbers the statement labels of 
Fortran programs. TPST (Typeset) is a program to format 
paragraphs and uses functions virtually identical to those in- 
dicated in Chapter 10. Pre is a pre-processor for Fortran 
which inserts common areas at the beginning of subprograms and 
does minor data massaging. Sort is a linked-list sort of a 
kind identical to Prog. 13.3. Refm reads a file with mixed 
tabs and blanks separating 4 fields and writes out the file 
with columns alligned using tabs as needed. With one exception 
(Sort) all programs were complete programs so that time spent 
in I/O and other necessary but unrelated activity would be 
included in the timing statistics. Not included as is 
evidenced from the data itself is the time spent compiling. 


The composite figure indicates the rather striking fact that 
over one-third of the time is spent in the interpreter. Most 
of this time would drop to nil if SITBOL had been a compiler. 
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However a compiler version of SITBOL would almost certainly be 
larger by close to the percentage of time saved so that the 
cost (measured in core-seconds) would be the same. The impor- 
tant issue is that the interpretive time is not larger than it 
is. Substantial amounts of time are going to other things such 
aS garbage collection (20%), string processing (15%), pattern 
matching (FG, PL and SMR, 10%) and 10 (7%). It is only in ap- 
plications such as Sort which use few of the facilities of the 
language (no storage allocation, no pattern matching) that the 
interpreter time is really excessive. Thus semantically rich 
processors such as SNOBOL4 have two reasons for heing written 
as interpreters. The semantical richness is easier to write 
and there is not that much being lost. 


Comparing individual columns it may be seen that the vre- 
processor Pre spends relatively large amounts of time doing 
I/O because it has virtually no work to do on most lines read. 
The relatively low figure of 18% interpreter use in the For- 
tran renumbering program is probably do to the heavy use of 
concatenation and pattern matching and the rest of the data 
bears this out. TPST spends by far more time in SMR than do 
the other routines and this is because it is continually scan- 
ning for USCOREs and BSPACEs as was pointed out in Chapter 10. 
The PDP-10 has no automatic scan instruction like the IBM 360 
but nonetheless even in this exagerated use of the BREAK func- 
tion, relatively little time (7%) is spent streaming. The DFF 
entry indicates the amount of time spent in function calls and 
is relatively small even for heavily recursive applications 
such as Sort. The amount of time spent in this category had 
more to do with the structuredness of the program. TPSET, as 
a look at Chapter 10 would reveal, is well-modularized and a 
certain price must be paid, but the cost is not excessive. It 
is somewhat surprising that areas such as numerics, conver- 
sions, tables, arrays, def ined-datatypes, and keywords 
represent so little of the total time (3.7%). Even, for exam- 
ple, when the defined datatypes are used rather heavily as in 
Sort, the amount of time spent in DFD is relatively small 
(4.3%). 


How do these figures compare with the corresponding figures 
for MAINBOL and SPITBOL? Since SPITBOL is type-3, the time 
spent in INT would be reduced substantially and, to a first 
approximation, all other activities would experience a propor- 
tional increase (just to make up the 100%). The Garbage Col- 
lection time would be reduced somewhat because SITBOL, 
operating in a time-sharing environment, deliberately keeps a 
‘low profile! to keep a relatively good priority. This results 
in garbage collections every 1500 words or so which is quite 
frequent compared with batch-oriented systems such as SPITBOL. 
The STR (String Handling) area would also be reduced in 
SPITBOL because the IBM 360 is a byte-orented machine with 
certain built-in string operations. The result is that SPITBOL 
should be more nearly balanced in its overall profile with 
much of its time being spent in pattern matching, defined 
functions, IO and garbage collection. This, however, will 
depend considerably on the application. MAINBOL has an inter- 
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pretive loop about twice as slow as SITBOL and has a much 
slower garbage collection, pattern matcher and I/O. Since 
overall program time goes up by more than a factor of 2, the 
time spent in the interpreter for  MAINBOL would actually 
decrease (to say 25%). IO, GC, PL, PG and SMR times would 
increase whereas other times would likely remain roughly the 
same. 


Cee ON 

11 Program 11 To accumulate his own timing statistics, 
E 11. 1 E the programmer will make calls on the 
{{ RESOLUTION |i built-in function TIME(). The value 
_-  _ cam returned is not uniformly increasing, but 
rather rises in steps which are sometimes rather large. on 


many systems the step size, called the resolution, is one- 
sixtieth of a second which is fairly large as many things can 
happen during this time period. It is essential to know or be 
able to compute this resolution to obtain accurate timings. 
Fortunately, this is rather easily done. 


DEFINE ("RESOLUTION () T*) : (RESOLUTION_END) 
A A RAM A Aa. 
| Entry point: Initialize T to the current time. Then | 


| repeatedly set RESOLUTION to the difference between the | 
( current time and this initial time. When it goes positive, | 


| the smallest resolution is obtained. | 
|n ———— a ee | 


RESOLUTION T = TIME () 
RESOLUTION_1 RESOLUTION = TIME() - T 
GT (RESOLUTION, 0) :S (RETURN) F (RESOLUTION. 1) 


RESCIUTION END 
Epilogue 


Since TIME() returns an integer in milliseconds, it is 
possible that the resolution may be off by as much aS a mil- 


lisecond. For example, on the IBM 370 Mod 165 the interval 
timer resolution is 3.3 and RESOLUTION returns 3 two-thirds of 
the time and 4 one-third of the time. In such cases, 


RESOLUTION could be modified to return a constant known value. 
But it should be remarked that only an approximate value for 
the resolution is ever needed. Exercise 11.6 explores another 
possibility for improving the behavior of RESOLUTION. 


go ree ey 
Program The timer routine shown below will time a 


1! I! 
N 11.2 E statement (or statements) passed to it as 
(1! I! 


arguments. Thus 
UL —— ————————— | 


TIMER(' A = B + C !) 


will determine how much time is required to execute the given 
assignment statement and will print appropriate statistics. 


If more than one statement is to be timed they should be 
separated by semicolons. 


To time a statement it is placed in a loop and executed for 
several times longer than the resolution of the clock. In 
order to deduct the time required to increment a counter and 
test, the loop is executed twice, once with the statement in 
and once with it out. 


DEFINE ("TIMER (S_,N_)C_,T_,1_') : (TIMER, END) 


| tcx DCN CC "c E KC CECI NM INC MCCC KM ECCE 
| Entry Point: On first call, fall through. When TIMER is | 
f called recursively, N is nonzero and control passes to | 


| TIMER N. | 
¡AA A a 
TIMER EQ (N_,0) : F(TIMER_N) 


GA AAA O A A A A LOL SI A SN 
{ Starting with 10 executions, double the number until the | 
{| difference between the times required to execute and not | 
| execute the given statement is 20 ticks of the clock. | 
ÁS 
N. = 10 
TIMER_1 T. = TIMER(' ;' S ,N ) - TIMER(,N_) :F(FRETURN) 
N LT(T ,20 * RESOLUTION()) N *2 :S(TIMER_1) 


ee 
( Now print the results. | 
ASI IA | 


T = CONVERT(T_, 'REAL') 

OUTPUT = 

OUTPUT = 'THE STATEMENT! 

OUTPUT = S_ 

OUTPUT = "REQUIRED ' (T / N_) ' MILLISECONDS +/- 10%? 
+ '* TO EXECUTE IN * SYSTEM() : (RETURN) 


O SS E E IRS RA | 
| Here if N_ is nonzero. Prepare a string C_ which will be | 
| compiled and executed and will contain the statement to be | 


| measured together with a control loop. | 
| remo aec E E SO O A | 


TIMER_N BL. 9 3 
C. = ' COLLECT( ; TIMER = TIME() ;' 
* 'TIMER 3" S. tse 
+ ' I -I 1 LT(I,' N. e") *S (TIMER 3);' 
* ' TIMER = TIME() - TIMER : (RETURN) ' 


| pm aC WIE I EN MEK I € MCN NR 
| Compile the string and, if successful, execute it. | 
— e — —— — — ———————— —,—————————— ——————À  —————X—' 


C = CODE(C) ¿S<C_>F (FRETURN) 
TIMER END 
Names referenced Name Type Where defined 
by TIMER: SYSTEM Function Program 11.3 


RESOLUTION Function Program 11.1 
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Epilogue 
Note that the temporaries and arguments are given 'funny' 


names, i.e. ending with the underscore (_) character. This is 
to avoid conflict with variables in the statement being timed. 


por a ee ee TEN 


lt Program || SYSTEM() is a function which will attempt to 
li 11.3 E determine which of the various SNOBOL4G 
(| SYSTEM N processors it is running under. For example, 
———— under SPITBOL, SY STEM () will return 


'SPITBOL'. The function is not easy to write because if there 
is a difference between any two processors this may be 
regarded as a deficiency and may get fixed sometime in the fu- 
ture rendering the function we're about to write invalid. 


One of the main differences between the various systems is in 
functions and/or keywords implemented. Unhappily, one cannot 
test directly for the existence of such functions or keywords 
SO knowing about such differences does us no good. 


SYSTEM() was used to identify which implementation was being 
measured by TIMER and is provided more for its intrinsic in- 
terest than its necessity. 


DEFINE ("SYSTEM () K*) : (SYSTEM_END) 
[ul oue Co UU EU ee rey 
{ Entry point: First separate out MAINBOL from the other | 


| processors. Only MAINBOL regards .X as a string. | 
-——— A —— ———— ———————À— —— M —— —— —— — ————?oMÜ M OMEN | 


SYSTEM IDENT(DATATYPE(.X), 'STRING'!) :F(SYSTEM 2) 
| UNA ae n A IDEM EC CIM E MCCC ELM ENDE 
Falling through implies MAINBOL. Now separate out the | 


various systems on the basis of the SIZE of &ALPHABET. The | 
Honeywell 635 uses a 9-bit code. IBM equipment uses an | 
8-bit character while the PDP-10 uses 7-bit ASCII. | 


A | 


K = SIZE(5ALPHABET) 
SYSTEM = EQ(K,512) ‘HONEYWELL MAINBOL' :S(RETURN) 
SYSTEM = EQ(K,256) ‘IBM MAINBOL' : S(SYSTEM, 1) 
SYSTEM = EQ(K,128) 'PDP-10 MAINBOL' :S(RETURN) 
Hx EG x c cC ENIM PIC CM CM FECE C IC CI c: : 2 0  : - 0 0 c0 : : 0 0 0 : 0 | 
Both CDC and UNIVAC MAINBOL's use 6-bit codes. We can | 


| 

| distinguish between these two systems by the order of | 
| characters in SALPHABET. Only CDC contains () as adjacent | 
| characters. | 
AAA a a a a 


SYSTEM =  'CDC MAINBOI' 
GALPHABET *()" :S(SYSTEM 1) 
SYSTEM =  'UNIVAC MAINBOL* : (RETURN) 
| IgE egg ee at ae en E EDI A CM Cl DEL a eS E a es ee Og eT eg ee ae 


| Here to test if the system also contains blocks. The 
| operator sharp (#) will have a lower precedence than blank 
| if the blocks extension is available. If the value of T is 
(1 (5 + 5) then we're in pure MAINBOL. Otherwise we've got 
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| blocks. l 
¡A a a EEO OES E | 
SYSTEM_1 OPSYN('OLD SHARP','£',2) 
OPSYN('!£','«'!,2) 
T = 1 5 #5 
OPSYN('£4','OLD SHARP'!,2) 
EQ (T, 110) : S (RETURN) 
SYSTEM = SYSTEM ' WITH BLOCKS! : (RETURN) 
Gat NE DEM IPM ee ae ee eg eee E ME MM P eee 
| Here if not MAINBOL. FASBOL has an unorthodox SUBSTR func- | 
| tion. l 
Co C" ——— ee | 
SYSTEM 2 
SYSTEM = DIFFER(SUBSTR('ABC',2,1),'B'!')  'FASBOL' 
" : S (RETURN) 


| xcu SESS ECCE CI UM C PCR CM A | 
{| SITBOL, running on the PDP-10, can easily be distinguished | 
| from the IBM SPITBOL by the size of SALPHABET. | 
| Mee ORE A ERUDITI IR O RN E E IRR E E PE E ENE OE | | 
SYSTEM = EQ(SIZE(&ALPHABET) ,128) 'SITBOL'! :S (RETURN) 
SYSTEM = 'SPITBOL!' : (RETURN) 
SYSTEM END 


Epiloque 


The above function is obviously incomplete as it does not 
include all machines for which MAINBOL has been expanded. If 
your favorite processor is not among the group you are 
encouraged to modify the program to include it. 


| $€* natomy of a SNOBOLU Statement | In this section we 
($$ $ AAA Wl] study the time 
I$ *| requirements of SNOBOLU statements. Such an analysis 
| f€** | may at first blush seem rather difficult because in 
(4*9 % | a language as rich as SNOBOL4 there is 'so much 


t——————3 going on'. But just the reverse is the case. For 
example, Table 11.3 shows the times required to execute in 
SPITBOL and MAINBOL a sequence of four statements in ascending 
order of complexity. TIMER, Program 11.2, was used to time 
these statements and is responsible for other similar timing 
figures given in this section. All times in this section were 
made (or normalized to) an IBM 360 Mod 65. For possible com- 
parison with other processors, some representative instruction 
times are given in Table 11.4. 


In Table 11.3, we see that the null statement (statements 
which do nothing) consume relatively little time; i.e. state- 
ment Overhead is relatively small. Assignment is fairly fast 
since, for all datatypes, it is merely a descriptor (two 
32-bit words) copy. But the most notable thing about Table 
11.3 is that there is a linear relationship of time with the 
number of arithmetic operators. 


This relationship is more nearly linear in an interpreter or 
type 3 system because the various operations are  'packaged' 
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Table 11.3 Time in milliseconds required to 


| 

{ execute a sequence of arithmetic assignment 
| Statements. 
| 
l 


Statement SPITBOL MAINBOL 
{ SE we ww www aaa oe Oe Ow Oe e eee Ow ww ew ww we we Ow ww e ww ww ww = 
| A . 0012 .02 | 
| A - I1 . 004 . 10 | 
i A=I+J - 009 30 | 
| A= 1* J-*kK .015 -50 | 
{ A =T+J]J] «¿Ke . 021 .70 | 


more so than in a type-4 compiler. In a type-4 system, code 
optimization techniques render more interaction between opera- 
tions of the same expression so that the time of a statement 
is not simply the sum of the times of the component 
operations. 


Measuring the time of an operation which does not generate 
storage is fairly straightforward as the direct measurement by 
TIMER may be used. If the operation generates storage which 
must later be collected, an additional increment of time 
should be charged to such an operation. We will see later how 
this can be done. 


Arithmetic Table 11.5 shows the time required for arithmetic 
operations. In MAINBOL the time is dominated by overhead so 
that all operations, even exponentiation, take pretty much the 
same time (about .2 milliseconds). This even includes the case 
where one of the operands must be converted to string or real, 


( Table 11.4 Selected instruction times for the IBM 
( 360/65. (N is the number of characters involved in a 
| multiple-character operation.) 
| 
| 
| 


Operation Time 
(microseconds) 
SD eee ae E l 
| Load (1 word) .95 | 
| Store (1 word) . 93 | 
| Add (storage-to-register) 1.65 | 
| Floating add (storage-to-register) 1.68 | 
| Multiply (storage-to-reg.) 4.45 | 
| Divide (storage into reg.) 9.00 | 
| Compare (reg. with storage) 1.40 | 
| Branch 1. 10 | 
l MVC (storage-to-storage move) 3 + .3N t 
| CLC (storage-to-storage compare) 2.9 + .3N q 
{ TRT (SPAN & BREAK) 4.1 + 1.2N | 
| TR  (REPLACE) 1.9 + 1.8N | 


| EHE SENCSDCRCR RN ECCO MEC TIME A eC EI ERN MEER A RR Ru) 
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In SPITBOL, as may be expected, the overhead has been reduced 
to the point where variations in the natural execution times 
do show up in the time for the overall operations. Thus, in- 
teger division (.019) is longer than integer multiplication 
(.014) which in turn is longer than addition (.007) which 
reflect differences in the absolute times to perform these 
instructions (.009, .005, and .001 respectively). 


( Table 11.5 Time in milliseconds to carry out selec- 
( ted arithmetic operations in SPITBOL and MAINBOL on 
{| the IBM 360/65. 
| 
| 


Data Type Operation Data Type SPITBOL MAINBOL 
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| integer * integer .007 i2 | 
| integer - integer .007 «2 | 
| integer * integer -014 22 | 
| integer / integer .019 .2 | 
| integer * integer .039 a2 | 
| integer REMDR integer «035 . 18 i 
| integer + real -061 sz f 
{ real + integer .067 a2 | 
| real * real .016 214 | 
| integer + string (2) . 084 . 22 | 


A — ——————————————— Ms | 


Table 11.5 shows a ratio of improvement of SPITBOL over 
MAINBOL which varies from about 25:1 in the case of integer 
arithmetic to about 2.5:1 in the case of addition with one ar- 
gument a string. This is because, in the latter case, the time 
is dominated by the conversion, and this MAINBOL does within a 
single macro, so that the SPITBOL approach grants no 
advantage. 


Flow of Control Various operations associated with flow of 
control are given in Table 11.6. These figures should be suf- 
ficient to predict the time of simple looping control 


instructions. 


For example, the standard method of implementing a loop in 
SNOBOLU is some variant of 


N 0 
LCOP N N + 1 LT(N,100) :F(LOOP OUT) 
: (LOOP) 
LOOP OUT 


which will execute the inner part of the loop 100 times. The 
Statement labeled loop will be executed 100 times before 
failing. Predicates such as LT() will return the null string 
when they succeed as this is the least flagrant value they can 


Table 11.6 shows time in milliseconds of flow-of- 


l 
control type operations for SPITBOL and MAINBOL. | 
l 
l 


Operation SPITBOL MAINBOL 
Ea ea A E MTM 
| GT,LT,EQ,LE,GE,NE .02 .2 l 
t IDENT, DIFFER .02 .2 I 
| LGT .05 .35 ( 
| Null Concatenation .02 .2 | 
| Label Goto .027 m M i 
| Code Goto . 037 . 20 | 
| Function call (N - | 
| * of args and temps) .09+.012N .40+.03N | 


A O | 


return. Concatenation treats null as a special case simply 
returning the other value and hence is very fast. 


The time to execute the statement labeled LOOP can be obtained 
by adding the times for assignment, addition, LT() and null 
concatenation which yields .70 for MAINBOL and .051 for 
SPITBOL. To this should be added the time to execute a label 
goto which brings the total control overhead to .87 and .078 
milliseconds respectively. 


The time to execute a goto is influenced slightly by whether 
its a fail goto, success goto, and the actual configuration of 
the goto portion of the statement. The figure given in Table 
11.6 is simply an estimate usable mainly because the transfer 
of control consumes, normally, avery small portion of the 
total time. The total time required by a function is found by 
adding the function overhead time, given in Table 11.6 to the 
time required to execute the function's statements. The time 
of a RETURN (or FRETURN) is absorbed in the function overhead. 


Miscellany Table 11.7 contains a miscellaneous collection of 
times for a number of different operations. Some of the opera- 
tions generate storage which will lengthen subsequent garbage 
collections but the times given do not reflect this cost (see 
the Epilogue of TIMEGC, Prog. 11.4). It is interesting to note 
that with the indirect reference (unary $) the time required 
by SPITBOL and MAINBOL are almost the same. Because MAINBOL 
hashes all data strings it does not have to hash for indirect 
reference. SPITBOL does, but the hashing does not take as long 
as MAINBOL's interpretive loop. Pattern Matching The execu- 
tion of a pattern matching statement consists of five distinct 
parts: subject evaluation, pattern evaluation (pattern 
building), pattern matching proper (scanning), object evalua- 
tion and replacement. Not all of these operations need be 
present. The time to execute such a statement is the sum of 
the times of its component parts. The subject and object 
evaluation are in the same category as ordinary expression 
evaluation. The replacement operation is approximately equiva- 
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Table 11.7 shows timings of miscellaneous opera- 
tions. N, where indicated, is the number of charac- 
ters involved in the operation. Times do not include 


garbage collection overhead. 


Operation SPITROL MAINBOL 
jueces ai em Lae a a a | 
| Concatenation .05+.0005N .35+.0005N | 
{ SIZE 2023 . 13 | 
I| DUPL (of a single char) .045+.0003N .6+.027N | 
l $ (indirect reference) .09 . 12 | 
| PROTOTYPE .016 . 13 l 
| A<IÞ> .03 . 30 l 
| A<I,J> .07 .Y5 | 
| ARRAY(N) .06+.03N .7+.03N i 
| CODE(' X = Y + Z :(LA)') 1.53 3.7 
| EVAL('LGT(S1,S2) ') 1.2 3.1 i 


| cU T" | 


lent in time to two concatenations and is given in Table 
11. 10. 


The time required to build a pattern is, to a first approxima- 
tion, proportional to its size. Table 11.8 contains some 
representative times for the construction of patterns. 
Variables A, B and AB are used rather than constants 'A', (B' 
and 'AB' because SPITBOL precomputes any constant-valued ex- 
pression such as ‘Af | 'B'. As indicated in the table, the 
time is measured in the absence of garbage collection. As we 
will see, garbage collection will approximately double this 
figure. 


ee MM C M ICD M EM DELE COPI MCCC MOD MU MS M CI MEI DC M CRM C EM MM CS P ED. 


| Table 11.8 indicates timings (in milliseconds) of | 
{ selected pattern-building operations. Times do not | 
| include that attributable to garbage collection. | 
$ ———————————————————————————————————————— 
| No. of | 
| Pattern expression SPITBOL MAINBOL Primitives | 
(SARA AAA AA ee ay eee TET | 
l AJB . 167 -80 2 | 
|! (A { B). X - 466 La] 4 | 
1 (A1B).X(A1B) . Y 1.16 2.7 8 i 
| BREAK (A) . 07 .36 1 i 
{ BREAK (AB) 212 .36 1 | 
| BREAK (AB) . X -41 .93 4 | 
| BREAK(AB) . X LEN(1) .57 1.78 5 i 
[wt eon ox cM ee rece oec ce cce m | 
| where: A = ‘At, B= 'B', AB = !AB! { 


To a first approximation the time required for pattern mat- 
ching proper (Scanning) is some fixed overhead given by Table 
11.10 plus the total attributable to individual primitive 
matches (and failures) as given by Table 11.9. Thus the pat- 
tern match below 


S =  DUPL('A*,100) 

S ("At | BS) Nel 

will have approximately 3N primitive matches, N successful 
matches by 'A', and N failures each by 'BP' and 'C'. Table 11.9 
indicates that in SPITBOL it requires .04 milliseconds per 
string primitive resulting in a total time of 12 milliseconds 
plus overhead. 


| 
per Character for selected primitives. N indicates | 
the number of characters matched for multi- | 
character operations. | 
| 
| 


Primitive SPITBOL MAINBOL 
| uoce a AAA ie eee eee ee | 
| String .040 . 18 i 
| RPOS (N) .020 . 20 | 
| LEN (N) .020 . 20 | 
| POS (N) .020 . 20 | 
| NOTANY (S) .028 .24 t 
| NOTANY (*S) .071 .42 f 
| SPAN (S) .040+.0014N .25+.0014N | 
| BREAK (S) .040+.0014N .25+.0014N | 


———————————————————————————— 


Table 11.10 Other miscellaneous timings associated | 
with pattern matching. Times are in milliseconds and | 
are approximate. | 
I 
i 


Operation SPITBOL MA INBOL 
a aa a a a la AA, | 
| Matching Overhead . 09 $9 | 
| Replacement .082*.0005N .42+.0005N | 
| Pure String Scanning Rate .0014 . 04 | 
| (per character) | 
{| ARBNO, per iteration .010 .26 | 
{ GBAL .043+.017N .22+.033N | 


The reader is cautioned that this analysis is approximate. The 
time required to scan (P1 | P2) will be less than the sum of 
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the separate scanning times. Also failure will be slightly 
different than success. If differences on the order of 20 % 


or SO are significant the reader is urged to make his own 
timing tests of time-critical statements. 


The reader should also note that pattern matching heuristics 
play a significant role in affecting the overall time. Thus 
the pattern 


POS (143) "'cAT' 


will result in two primitive matches in SITBOL AND SPITBOL 
because of the POS heuristic (see Chapter 7) but will require 
145 primitive matches in MAINBOL (assuming the subject is long 
enough). Also, the futility heuristic can greatly reduce the 
number of primitives matched. 


When the pattern is a simple string, SPITBOL and MAINBOL treat 
it as a special case resulting in a faster scan as indicated 
in Table 11.10. If ARBNO appears in a pattern, then to the 
time required for all primitive matchings must be added the 
sum of all ARBNO extents multiplied by the given weighting 
factor given in Table 11.10. BAL, as indicated in Chapter 7, 
is implemented by the repeated use of a primitive GBAL which 
matches the shortest nontrivial balanced string. Thus BAL will 
match the string '(XXXX)' with one application of the primi- 
tive GBAL and will match 'XXXXXX' with 6 applications of GBAL. 
Hence it requires much less time to match the former than it 
does the latter. For example, in MAINBOL, it requires .22 + 
(.033) (6) MSEC. to match '(XXXX)' whereas it requires (.22) (6) 
MSEC. to match 'XXXXXX'. 


I/O Tinming When INPUT is mentioned in the source program, a 
line is read. How long does it take? This has no easy answer. 
Clearly different devices require different times. Even if we 
restrict our attention to one device, such as the disk, the 
issue is compounded by a host of factors. As a rough rule of 
thumb the total time required to move the arm of a disk drive 
into position (seek time) and wait for the information to come 
under the read heads (latency) plus the amount of time to ac- 
tually read is, to grossly simplify, in the order of 100 mil- 
liseconds. This figure is not normally charged directly to 
the user since the operating system can direct the cpu to do 
other things during the interim. This represents an extra- 
ordinarily complex situation not made less so by a variety of 
charging algorithms and scheduling philosophies. A rule of 
thumb is that the effective cost is equivalent to half the 
elapsed time. Hence, for disk, one may assume 50 milliseconds 
per transmission. Since the time of transmission is relatively 
independent of the amount transmitted it pays to transmit more 
than one line at a time. Hence, lines are transmitted in what 
is called a block. The number of lines per block is called 
the blocking factor. Typical blocking factors for efficient 
disk I/O is on the order of 100 which converts the effective 
transmission time to .5 milliseconds per line. 
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To this we must add the processing time to extract a given 
line from a buffer. This again will require rule of thumb 
estimates. In MAINBOL a rather slow Fortran conversion routine 
causes an I/O operation tc require 5 milliseconds per line 
(IBM 360 Mod 65). Hence if the file is properly blocked, I/O 
times are dominated by this figure. In SPITBOL, Fortran I/O 
is sidestepped and the required processing takes about half a 
millisecond. Hence, in SPITBOL, an I/O reference requires a 
total of approximately one millisecond. 


Go A 

{{ Program |i The following program will permit the caller 
11 11.4 (| to time a 'typical' garbage collect. 
I|} TIMEGC tI Strings, array elements and programmer- 
t____________ defined datatyves are strewn about in rather 


chaotic fashion and a call is made to clean some of it up. An 
argument to TIMEGC can be given which will alter the amount 
and somewhat the type of litter. The caller may experiment 
with other values of this number as well as with different 
kinds of allocation to see if the garbage collect time 
Significantly varies. 


DEFINE ('TIMEGC (N)I,S,A,L, T, K,FREED!) 
DATA (' LINK (VALUE, NEXT) ') : (TIMEGC_END) 


O E IT ADU Fe eat ee ee oe 
| Entry point and top of loop. Free everything and issue a | 
| garbage collect. | 
A | 


TIMEGC I= 3; S= 3; A= ; L= 
COLLECT () 
N = IDENT (N) 25 
A = ARRAY (N) 


Eee ee C PM a eg ee ge ae py EIC og eae een 
( Allocaticn loop: For each I from 1 through N allocate ap- | 
{ proximately one length-80 string, assign a length I string | 
{ to A<I> and add one element to the linked-list L. | 
eT RDUM PR E EAEE A | 


TIMEGC_1 I = 1+1 


$I = DUPL(' ',78) I 

AXI» = DUPL('*',T) 

L = LINK(NULL,L) 

GE (I,N) :F(TIMEGC. 1) 


A aR | 
| Determine the storage remaining. Then loosen about half of | 
| it and issue a garbage collect. Determine how much was | 
| collected and how long it took to make the collection. | 
LI ———— O PE AEE Ee MM E MM SE E A EA CRM NE du t E m CREE M RR EE 


STREM = COLLECT() 
TIMEGC_2 
$I =  AXI> = 3 L = NEXT(L) 
I = I-2 GT(I,2) :S(TIMEGC. 2) 
T = TIME() 
FREED = FREED + (COLLECT() - STREM) 
TIMEGC TIMEGC + (TIME() - T) 


K = K * 1 


CSc ee pe a O ee D C C "c c c c cc ee ee eee ne eee ee 
{ If not significantly more than the resolution of the | 


| clock, go back for more. Otherwise produce some | 
| statistics. | 
SS ———————————————ÉP— 
LT (TIMEGC,50 * RESOLUTION ()) ¿S (TIMEGC) 
OUTPUT = 
OUTPUT = 'IN * SYSTEM() * ' K * GARBAGE COLLECTS! 
+ REQUIRED A TOTAL OF * TIMEGC ' MILLISECONDS TO FREE ' 
+ FREED * STORAGE UNITS. ! 
TIMEGC = CONVERT (TIMEGC, '‘REAL') 
OUTPUT = 'THIS AVERAGES TO ' (TIMEGC / K) 'MSEC. PER! 
+ ' GARBAGE COLLECT AND * (TIMEGC / FREED) ' MSEC. PER! 
+ ' STORAGE UNIT.' : (RETURN) 
TIMEGC_END 
Names referenced Name Type Where defined 
by TIMEGC: RESOLUTION Function Program 11.1 
Epiloque 


TIMEGC(N) was called for various values of N and the results 
are given in Table 11.11. 


Table 11.11 Data obtained by calling TIMEGC with a | 
variety of arguments. | 
| 
| 
l 
l 


l 

l 

l 

| I SPITBOL l MAINBOL 

| | | 

| | Ave GC Storage Time | Ave GC Storage Time 

I N | Time Coll. per byte | Time Coll. per bytel 
| { (MSEC) per GC (Mcrsec) | (MSEC) per GC (Mcrsec) | 
ral ita em cias l 
1 50 1 17 3.4K 5.0 | 98 5.8K 17.0 | 
(100 | 27 8. 1K 323 | 105 13. 5K 8.9 | 
(150 | 41 14.0K 2.9 | 144 21.6K 6.7 | 
(200 | 51 21.3K 2.4 { 196 31.5K 6.3 | 
(250 | 77 30. 0K 2.6 pt 220 42.6K 5.2 | 
1300 y 104 39.4K 2.6 | 224 55.0K 4.1 | 
1350 y 138 50. 0K 2.8 y 256 68.4K 3.9 | 
(400 | 183 62.4K 2.9 1 304 83.3K 3.5 | 
(450 | 210 76.0K 2.8 | 343 100 K 3.5 | 


As might be expected, the time to garbage collect is a func- 
tion of how many allocated objects are lying about in core. 
For small collections, SPITBOL has a clear advantage over 
MAINBOL; but this advantage curiously diminishes as the col- 
lections become larger. (This anomaly has yet to be 
explained.) Also, as collections get larger, the time required 
per byte collected seems to converge to about three 


A A ER O A AO GEAR ES IE 


microseconds. This figure is not absolute since garbage col- 
lections in which very little storage as a fraction of the 
whole is retrieved can require much more than this. Neverthe- 
less, it serves as a useful rule of thumb for estimating the 
garbage collection overhead attributable to an operation that 
allocates storage. For example Table 11.7 indicates the time 
for concatenation to be .05+.0005N milliseconds in SPITBOL. 
To this we must adda factor attributable to later garbage 
collection. In SPITBOL, a string requires 6 + N bytes of 
storage as indicated in Table 11.12. Using a figure of 3 
microseconds per byte, the real cost of concatenation is .068 
+ .0035N milliseconds. 


ne a ee ep ey n ig o a ee 
| Table 11.12 shows the amount of storage required for a 


variety of datatypes. Storage is given in bytes. 


| | 
Datatype l SPITBOL | MAINBOL 
Josue NNNM ee TT quc pem | 
| String (N is no. of chars.) | 6 + N | 32 +N | 
I | i | 
| Variable (N is number | | | 
1 of characters in name) | 38 + N | 32 +N | 
| | l | 
| Patterns (N is no. of primitives, | | | 
| A is no. of ANY, NOTANY's, | | | 
| B is no. of BREAK & SPAN's*, | 16 + 16N + | | 
{ figure is approximate) | 32A + 256B | 8 + 32N Y 
| | | | 
| Arrays (N is no. of elements and | | 1 
| D is no. of dimensions) 1 20*8N*8D | 16+8N+16D | 
l | | | 
| Prog. Defined Data Object | | | 
| (N is no. of fields) | 8 + 8N | 8 + 8N | 
l | | | 
{ Table (E is no. of items in | | | 
( the table and I is the initial | | | 
{ first argument to the TABLE | { | 
| function) | 12+24E+4T | 8+ 16E | 
pem ee O ES 1 eee LAA AA | 
{ * If the argument to PREAK or SPAN is only one character, | 
t no additional storage is required (B is 0). | 


a | 
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$$% he Inner Loop | It is characteristic of many programs 
$ AS that approximately 90% of the time is 
£ | spent in 10% of the program. This is true of SNOBOL4 
% | itself and it tends to be true of programs written in 
% | the language. Whether or not the topology of the 
LW program merits the epithet, the point or points 
within the program where most of the time is spent is called 
the ‘inner loop’. While the SITBOL system has an automatic 
method for determining which statements are responsible for 
the most time, most SNOBOL4Y systems do not. There do exist, 
however, certain tracing tools which may be used to examine a 
program's behaviour and extract at least approximate timing 
information. 


pU SUUM 

(|! Program |i LPROG() will return the length (i.e. the 
E 11.5 Ii number of statements) in the SNOBOL4 program 
E LPROG [KM in which it is called. LPROG will actually 
QS cause one more statement to be compiled at 
run-time so that its repeated use will return slightly dif- 
ferent values. If new code is compiled in the interim, the 


value returned by LPROG will be augmented by the number of new 
statements 


DEFINE ('LPROG () *) : (LPROG_END) 
E a a EMG TARA RARE. 
| Entry point: Compile a statement and return 1 less than | 
| its statement number. | 
|—— Á—— A A A IS | 
LPROG ¿<CODE('* LPROG = &STNO : (RETURN) ') > 
LPROG_END 


Epilogue 


LPROG has intrinsic interest of its own as well as being a 
useful, if not essential, tool in constructing an array to 
record a program's profile (as we shall see). 


CS ae gp an te ee 

{{ Program ii FPROFILE is a program which determines the 
B 11.6 E number of times each statement is executed 
(| FPROFILE || in the program in which it is embedded. 
— — aeae eee This is called the frequency profile of the 
program. The statistics gathering begins when the initializa- 
tion section of FPROFILE is executed and tracing is turned on. 
Hence FPROFILE is normally placed before the program to be 
monitored but must be placed after the LPROG function which it 
calls during initialization. For each statement executed after 
tracing has been established, FPROFILE is called and a tabula- 
tion is made in an array (FP ARY). At any given time during 
the course of execution, statement number N will have been ex- 
ecuted FP_ARY<N> times. 


DEFINE ('F PROFILE () ') 


Wert Os rota Oy gr cas RE | ee VIC es KE VUE ny en ge Se ne ee eee ba eee ee ee 
| Allocate an array to gather statistics and set up tracing | 
( on the keyword &STCOUNT. | 
FA O EEEE ASE | 


FP_ARY = ARRAY (LPROG ()) 
TRACE(.STCOUNT, 'KEYWORD',, 'FPROFILE!) 
ETRACE = 1000000 : (FPROFILE_ END) 


A RE ON MEC I RECHNEN ee eg dg ON Spe een Cee EUN I ECC KC E ECC ea ope eae I UE SEMEL 
| Entry point of FPROFILE (called at each executable | 
| statement). | 
Losan Ne E TOREM 
FPROFILE FP ARYXSLASTNO» = FP_ARY<ELASTNO> + 1 < (RETURN) 
FPROFILE END 


Names referenced Name Type Where defined 


by FPROFILE: LPROG * Function Program 11.5 


* indicates name is referenced in the initialization section. 


E | 

(| Program 1! A time profile of a program indicates the 
E 11.7 11 relative time spent in each statement. In 
i| TPROFILE || a language like SNOBOLU, where there is a 
S ———MM—————À relatively high variation in the time re- 


quired to execute any given statement, a time profile is much 
more desirable than a frequency profile. 


TPROFILE, a modification of FPROFILE, allocates to the state- 
ment just executed the difference between the current time and 
the last previous time. Unhappily, the time required to gather 
the statistic may be as large or even larger than the time 
being measured. However it is likely to be more valuable an 
indicator than FPROFILE and in many cases can give a sur- 
prisingly accurate time profile. 


DEFINE ('TPROFILE () S,T*) 


SS A AM UAM AR REA | 
| Set up tracing. Times are tabulated in TP ARY.  TPROFILE | 
| will be called at the start of each statement to be ex- | 
| ecuted. | 
A A A E SS O AN | 

TP ARY = ARRAY (LPROG ()) 

TRACE (.STCOUNT, 'KEYWORD!,, 'TPROFILE') 

STRACE = 1000000 : (TPROFILE_END) 
O AR A E. AC ee es A 
| Entry rpcint: Save the statement number (S) of the state- | 
( ment about to be executed and quickly obtain the time (T). 1 
| Augment TP ARY according to the last interrupted state- | 
| ment. | 
a -———  —— d———————— ——————OÓÓ——-———————— ——— ————srom—— ee EE | 
TPROFILE S = &LASTNO 

T = TIME() 
TP_ARY<LAST_STNO> = TP_ARY<LAST_STNO> + T - LAST TIME 

LAST STNC = S 
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LAST TIME =  TIME() : (RETURN) 
TPROFILE END 
Names referenced Name Type Where defined 
by TPROFILE: LPROG * Function Program 11.5 


* indicates name is referenced in the initialization section. 


Epiloque 


TO test the two profiling programs, the function BNORM (Prog. 
10.1) was used. It was passed a string of approximately 120 
characters containing 10 BSPACEs and two USCOREs. TO average 
out noise effects, BNORM was called 250 times. The results of 
applying FPROFILE and  TPROFILE to the program are shown in 
Figure 11.3. 


The data was collected on the SITBOL system so that a 
comparison could be made with a *'true' time profile as 
provided by a built-in facility. Figure 11.4 shows the results 
of turning on the built-in profiler. As might be expected, 
the times are a little higher for TPROFILE than they are truly 
since each statement executed is accredited with a little of 
the overhead used to gather the statistic. But the results 
are surprisingly close due to the relatively small amount of 
time required to execute a simple assignment statement. 


For running TPROFILE on SPITBOL it is imperative to obtain the 
TIME() before &LASTNO because the latter represents a rela- 
tively slow operation. Exercise 11.11 provides a method of 
doing this. 


€ «6€ 90 * 9060 * «09 «9 € 06 «9 690 9€ « 9 0 9 90 90 9$ 09 9 9 € 9 0 0 € 0 e . . OCC 6 6 60 0.0. 00005000. .ÉÍ e . 0 ec 


LENIN MESE RNC AE | 
( Exercise 11.1 | Which of the following linguistic 


AY» facilities require a run-time symbol table? 


(a) Pattern Matching 
(b) a Sort facility 


(c) Run-time compilation 

(d) Redefinition of functions 

(e) Go to a label whose name is computed 

(£) call a function whose name is computed 

(g) Linked-list operations 

ES | 

| Exercise 11.2 | Each method below for computing hash num- 
> bers has at least one flaw. Indicate 
whether it is too time-consuming (T), does not provide a good 
spread (S) or is not repeatatle (R). More than one letter 


might be applicable. Assume each character is an 8-bit code 
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Figure 11.3 


The result of applying FPROFILE (above) and 
TPROFILE (below) to 250 calls to the BNORM func- 


tion. 


The numbers below the bars refer to state- 


ment numbers in BNORM. Times are in seconds. 
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Figure 11.4 


The histogram above shows the 'true' time profile 
of the program run to produce the histograms in 
Figure 11.3. Times are given in seconds. 


which represents some integer between 0 and 255. L is the 
length of the Hash Array. 


(a) Multiply all the characters together ignoring overflows. 
Then divide by L and use the remainder. 


(b) Divide the size of the string by L and use the remainder. 


(c) Let L be 256 and choose simply the first character as the 
hash number. 


(d) Let I be 256 and Exclusive-OR all the characters 
together. 


(e) Add the size of the string to the last previous hash num- 
ber and divide by L, using the remainder. 


(f) Use the machine address of the first character of the 
string. 


(7 SUUM 

| Exercise 11.3 | As indicated in the text, compilers can be 
AS ranked from Type 0 to Type 4. Each increase 
in compilation complexity brings about a decrease in run-time 
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flexibility. What type of compiler is required to implement 
each of the following language features in a reasonably 
straightforward way. For example, if your answer is Type 2, 
then all compilers of Type 2 and lower should have no special 
difficulty implementing the feature. By type 3 assume that 
the decision to push a value or a pointer to a variable is 
made at compile time. 

(a) Run-time modification of operator precedence 

(b) A Sort function. 

(c) Redefinition of SNOBOLU functions 

(d) Redefinition of SNOROL4Y operators 


(e) Run-time modification of the meanings of characters 
(E.g., hereinafter R is an operator). 


(f) Declarationless variables 

(g) Recursive functions 

(h) Run-time trace requests on variables 

(i) Run-time macros (hereafter all strings in the text of the 


program of the form X shall be regarded as string Y). 


SS | 

| Exercise 11.4 | Which of the following facilities are more 
t— likely to be associated with a floating 
form of storage management and which with fixed storage? 


(a) Declaring a variable to be string and giving it a maximum 
length. 


(b) Arrays containing arbitrary and mixed datatypes. 

(c) Garbage Collection. 

(d) Functions which return arrays. 

(e) String assignment implemented via copying. 

| RII NEM AN 

| Exercise 11.5 | Give an example of a statement which if 


AM timed using TIMER would result in an in- 
finite loop. 


ESO ae ES 

| Exercise 11.6 | Modify RESOLUTION (Prog. 11.1) so that it 
t——————————————-4 averages ten attempts to obtain the resolu- 
tion. Make sure the computation is done once and not at each 
call. 
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Ge rN es ce | 
| Exercise 11.7 | One can define the factorial of n (normally 
t—— written n!) as follows: 


DEFINE ('F(N) ') : (F_END) 
F F = LE(N,1) 1 :S (RETURN) 
F = N + F(N- 1) : (RETURN) 


F END 


Estimate the time required (in SPITBOL) to compute F(1), F(2) 
and F(n) for arbitrary n. Compare the time required for this 
recursive program with the following iterative version of the 
factorial function. 


DEFINE ('F (N) ') : (F_END) 
F F = 1 
F 1 F = GT(N,1) F*N :F (RETURN) 
N = N- 1 : (F. 1) 
F END 
CS a EN CE ee 
| Exercise 11.8 | You are writing a pre-processor in  SNOBOLU 


WS which will examine each line of a source 
statement for the occurence of a special character (say 4%). 
If the special character is there, the program will do 
something interesting. Otherwise it copies the line intact. 
Write an ‘inner loop! that does nothing but read and write and 
check for: the existence of the special character. Assuming 
the lines containing the special character are relatively 
rare, the speed of processing approximates the speed of the 
inner loop. Compute the speed of your pre-processor in state- 
ments per minute operating in SPITBOL. Assume I/O time is one 
millisecond per line. 


Co ee 

| Exercise 11.9 | Since error and trace messages are given in 
3 terms of SNOBOL4Y statement numbers it is 
helpful to have a method of producing such numbers for state- 
ments compiled via the CODE function. Redefine the CODE func- 
tion in an upward compatible way so that in addition to 
compiling code it sets the global variable CODENO to the num- 
ber of the statement (or first statement of a sequence) being 
compiled. (Hint: Look at the LPROG function and use the fact 
that SNOBOL4Y assigns statement numbers sequentially without 
breaks. Only two statements are required in the body of the 
function.) 


ee CE et | 

| Exercise 11.10 | Modify LPROG (Prog. 11.5) so that it will 
(> always return the value it returned when 
it was first called. (Hint: This can be done by the insertion 
of 5 characters.) 
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N AA A | 
| Exercise 11.11 Y TPROFILE (Prog. 11.7) attempts to obtain 


t———— the TIME() as quickly as possible but is 
torn by the fact that the first statement executed must cap- 
ture the LASTNO. Suggest how TPROFILE can be improved so that 
the TIME() is captured as quickly as possible in the first 
Statement without losing the value of &LASTNO. 
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jects and these are referred to as permutations. For 
example, there are 3! (26) ways of permuting the 3 


‘ : : 
(sp here are n! ways of rearranging (or permuting) n ob- 
B 
E 
E characters of the string 'ABC' as follows 
LJ 


ABC 
ACB 
BAC 
BCA 
CAB 
CBA 


There is a body of literature on the subject of permutations 
{Algorithms, 1968, p. 829] owing, perhaps, more to the value 
of studying permutations as a computational exercise rather 
than for strictly utilitarian reasons. Yet, the study of 
techniques employed to solve this problem is undoubtedly use- 
ful in discovering techniques for solving more practical 
problems. 


Permutation routines are subject to a variety of different 


ground rules. The object to be permuted may be an array, a 
list or a string. The array may be an array of integers 
(1,2,...,n) or an arbitrary array. The permutation may be 


lexicographic; in the case of strings this would imply that 
the permutations are produced in alphabetic order. In general, 
if the objects to be permuted can be compared relative to each 
other  (well-ordered in mathematical parlance) a lexicoqraphic 
order is defined on the permutation, and some algorithms are 
constrained to produce the permutations in this order. 
Sometimes the objects to be permuted contain duplicates such 
as the characters of 'MISSISSIPPI' and the permutation program 
is required to produce only those permutations which are truly 
distinct. These are sometimes known as "permutations with 
repetitions" or, as we will call them, reorderings. Finally, 
the permutation wanted may be a purely random one and the al- 
gorithm for doing that is included in the section on 
Stochastic Strings. 


qu ELO AA MI SUMI UST RAE e Y ETT 

| $% ERMUTATION RECORDS | We will speak in this section of 
|I $ $ p permuting n*1 objects. This may 
| $2995 | seem more awkward than speaking of permuting n ob- 
EE: { jects but it will have the advantage of making our 
($8 ( notation simpler. The number of permutations of n+1 


Lt objects is (n*1)! and the reasoning is as follows. 
Assume that the objects are selected one at a time in an ar- 
bitrary sequence to be placed in some permutation. The first 
object drawn can be placed in only one way. The second object 
drawn can be placed to the left or the right of the first ob- 
ject; the 3rd object can be placed to the left, between, or 
the the right of the previous 2 objects. In general, the ith 
object can be placed in any of i different positions and a 
little reflection will reveal that each position will lead to 
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a different permutation. Moreover, every permutation can be 
obtained by this means. Hence, the total number of permuta- 
tions can be obtained by multiplying all these combinations 
which yields the result (n+1)!. 


This reasoning leads naturally into the idea of a permutation 
record which is important computationally, because most al- 
gorithms depend on some form of this record to record past 
history. Let 

i, is eee in 


be a sequence of integers obeying the following inequalities 


0<i, <1 
0< i, <2 


For example: 
10 2 4 2 


is a permutation record for n = 5. A permutation record of 
length n can be thought of as representing a permutation of 
n+1 objects as follows: the first object is placed down. The 
second object is placed to the left or right of the first ob- 
ject depending on whether i, is a 0 or a 1. This process is 
continued until the (n*1)st object is placed in the position 
indicated by in. 


For some applications it is convenient to speak of the "Ith 
permutation" of n+1 objects where I ranges from 0 to (n+1) !-1. 
The integer I can be related to a permutation record as 
follows: 


I = i, + i2(2!) + i4,(3!) +... + in (n!) (12. 1) 


Such an I will be called the permutation number of the given 
record. The permutation record may be regarded as a represen- 
tation in the factorial number system of the permutation num- 
ber [Knuth, Vol.2, 175 and Pager, 1970]. For example, let i, 


ig i} = 102. Then 


I 


1 + 0(2!) + 2(31) 
102412 = 1 


3 


Thus every permutation record yields some permutation number. 
But is that number unique, or will two different records lead 
to the same number? We will show that not only is there a 
unique record for each number but that the record is easily 
reconstructed. First, note that 2 divides every term on the 
right hand side of (12.1) except the first so that 


i, = REMDR(I,2) 


To determine the remaining n-1 elements of the permutation 
record , set I, = (I - i,)/2 so that 


I; = le * is (31/2) t ... + in (n!/2) 


In this equation, each term is divisible by 3 except the first 
so that 


io = REMDR(I,,3) 
This process of division and remaindering can be repeated un- 


til all coefficients have been obtained. Hence, given a number 
I, the permutation record can be deduced. 


ë 
NE Program E PERMUTATION(S,I) will return the Ith 
N 12. 1 E permutation of the string S where I is a 
(|! PERMUTATION || permutation number as defined above. If 
ts I is 0 then the permutation is equal to 
S itself. If I > N! where N = SIZE(S), then PERMUTATION will 
fail. Note that we can obtain all permutations of a given 


string in this way provided N!-1 < the maximum integer. On 
the IBM 360, with a maximum integer of 23!-1, this amounts to 
the restriction that N<12. This seems rather severe and Exer- 
cise 12.11 suggests a remedy. Note that if one were cycling 
through each permutation of a set of objects one would be bet- 
ter advised to use a routine specially designed for that pur- 
pose (such as PERM, Program 12.2). 


| GGG EE FCRC S E O I CEN Vg Te Ne Fee) ILC MINCE CIC CE ICM I I I MMC LE ECCE, : 
| PERMUTATION(S,I) will return the Ith permutation of the | 
| string S. | 
jE a a a — -— —————————!————— II A | 
DEFINE (' PERMUTATION (S,1) RADIX,T,S1,N') 
: (PERMUTATION_END) 


A AN aaa MCCC eg ar wee IEEE I (CM EM ene 
| Entry point and top of loop: If I is 0 or drops to 0 as a | 
| result of repeated division, return the value remaining in | 
| S and the characters already accumulated in PERMUTATION. | 
—————————————————————————————— ———— ———!————Á— Os] 
PERMUTATION 
PERMUTATION =  EQ(I,0) PERMUTATION S : S (RETURN) 

IgE LECCE CCCII (LC CCCII C cL CREDE CM DM CLEA Sec | 
| Otherwise remove the next character of S (calling it T) | 
{ and insert it into the position determined by the next | 
{ value (N) of the permutation record. If no T could be | 
| found then fail because this means I was too big. | 
A SI EA E NAE — nC A | 


S LEN(1) . T = : F (FRETURN) 
RADIX = RADIX + 1 
N = REMDR(I,RADIX) 


PERMUTATION RTAB(N) . S1 = S1 T 
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I = I / RADIX : (PERMUTATION) 
PERMUTATION_END 


Epilogue 


Characters are inserted one at a time into the string 
PERMUTATION in a position depending on the value of the per- 
mutation record. The value indicates a number of characters 
from the right because in this way a 0 permutation and only a 
0 will result in an identity operation. 


PERMUTATION is not well suited for arrays (as it stands) 
because insertion of an object into an array (while neighbors 
are moved apart) is not a natural operation. Instead of in- 
terpreting each element of the permutation record as an inser- 
tion point, each value can be regarded as an interchange 
distance, as follows. Interchange A<2> and A<1> according to 
the value of i,. That is, interchange 


A<2> and A<2-i,> 


Then interchange A<3> with A<3-i2>. Continue in this way until 
A<n+1> and A<n+1-in> are interchanged. 


Can all permutations be obtained in this way? By a bit of 
backward reasoning we can conclude that they can. From the 
position in the permuted array of the last element of the 
original array one can determine the value of in. Hence the 
scene as it existed prior to the last interchange can be 
reconstructed. Continuing in this way, the entire permutation 
record can be reconstructed. That means that every different 
permutation record gives rise to a different permutation. But 
there are n+1! permutation records and hence all permutations 
must be obtainable. 


(Sone ee ee ee 

{{ Program || Although the function PERMUTATION can yield 
11 12.2 E a particular one of a class of permutations, 
li PERM li it is not particularly well suited for cy- 
t_——_————_____——4 cling through all permutations of a given 


set of elements. This is because each permutation is generated 
freshly. It is more efficient to continually modify the last 
permutation to obtain the next. Trotter [1962] produced a 
scheme in which only one interchange per call was necessary to 
obtain each permutation. His method is basically as follows. 
Imagine the objects to be permuted to be arranged from left to 
right and numbered from 1 to n. Interchange objects 1 and 2 
to produce a new permutation. Then interchange objects 2 and 
3, 3 and 4, etc. In this way the object which had heen on the 
left will swing in daisy chain fashion over to the right. When 
it reaches the right side it stops, the n-1 objects to its 
left are permuted once and, on subsequent calls, the last ele- 
ment is daisy-chained back from right to left. When it reaches 
the left, the other elements are again permuted and the 
process repeats. One needs a permutation record of sorts to 
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record this movement and this is done as follows. I, contains 
the position of the 1st element among the other (n-1) ele- 
ments. I2 holds the position of the 2nd element among the 
other  (n-2) elements, etc. (A separate array can hold +1 to 
denote direction of movement.) This system has the nice 
property that most permutations are done by a single test, 
increment, and interchange. The programming can be simplified 
by the use of recursion (not originally given by Trotter) 
without significantly adding to the time (see Exercise 12.12). 


PERM(A) uses  Trotter's algorithm to cycle through every per- 
mutation of a singly dimensioned array with lower bound 1. The 
first time PERM is called the array is not modified but 
initialization is made. The initial value of A is regarded as 
the first permutation. On subsequent calls, the argument to 
PERM (presumably the same array) is permuted. Finally, when 
no more permutations remain, PERM will fail and reset itself 
to its initial state awaiting a new array. 


PERM(A) will permute the elements of the array A, failing | 
when no more permutations remain. A is assumed to have at | 
least 2 elements. | 
| EEEE EE E E EE EE EE E ——— ———'————— O A | 


DEFINE (' PERM (A) ', 'PERM_INIT') : (PERM_END) 


E SS O O RR | 
| 
| 
| 


E ee aN NG E a he MEM Soe ee ee a | 
| PERM_INIT is the entry point on the first call to PERM. | 
| First obtain the size of A (by converting prototype to in- | 
| teger) and retain it for future reference in the global | 
| variable SIZE A. | 
| 


PERM_INIT SIZE A = +PROTOTYPE (A) 


| Set up arrays to indicate location and direction of move- | 
{ ment of elements. Initialize location arrays to 1 because | 
| every element starts in 1st position relative to remaining | 
| members. Initialize direction array to 1 to indicate | 
| rightward movement. -1 indicates leftward movement. | 
| M 


LOC ELEMENT 
DIR ELEMENT 


ARRAY ('0:' SIZE A - 2, 1) 
ARRAY ('0:' SIZE A - 2, 1) 


Redefine the entry point. All outside calls will have one | 
argument so that I and OFFSET will initially have the | 
value null. When PERM is called recursively I and OFFSET | 
are given different values. I represents the item to be | 
permuted and OFFSET represents the extent to which the | 
subpermutation of elements I, I + 1, ..., N - 1 is offset | 
from the overall permutation. | 
-—————— —— — P — ——————— —E 


DEFINE('PERM(A,I,OFFSET) RL, D, LIMIT,AL') : (RETURN) 


| Steady state entry point: Determine the relative location 
| (RL) of the Ith element in the subarray and the direction 
Iı (D) in which it is moving. Also determine the LIMIT of 
| travel in this direction. If the limit has been reached, 
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| go to PERM_1. 
(ICROETEPI EMIURPG E A EA MIL UMOR IS | 


PERM RL = LOC_ELEMENT<I> :F (FRETURN) 
D = DIR_ELEMENT<I> 
LIMIT = EQ(D,1 SIZE_A - I 
LIMIT = EQ(D, -1) 1 
EQ(LIMIT, RL) : S (PERM_1) 


| ECOLE CE CCELI M ICI CM CMM KM C DEM ge eG LC E MMC GM E EN C UC C ge ae 
| Determine the absolute location (AL) of the Ith element, | 


| swap elements, update location vector, and return. | 
| ERE EA CIE C CM CINE EI ECC C—— m dE CEU ES | 


AL = RL + OFFSET 
SWAP (.A<AL>, .AXAL + D>) 
LOC_ELEMENT<I> = RL +D : (RETURN) 


RN E O E | 
| Reverse the direction of movement of the Ith element. | 
( Determine the OFFSET of the subpermutation and attempt to | 
| make the permutation; if success return; otherwise, reset | 


( entry point and fail. | 
A A a 


PERM_1 DIR_ELEMENT<I> = -D 

OFFSET = EQ(D,1) OFFSET + 1 

PERM(A, I + 1, OFFSET) 2 S (RETURN) 
PERM_F DEFÍNE('PERM(A)', 'PERM INIT!) : (FRETURN) 
PERM END 
Names referenced Name Type Where defined 
by PERM: SWAP Function Program 3.14 
Epilogue 


The program is written recursively because this is the way the 
algorithm is described, and because the inefficiencies of 
recursion will not manifest themselves in substantially slower 
programs. A difficulty involved in specifying the function 
recursively was that the recursive call is to permute an array 
which does, not exist in isolation but only as part of a larger 
array. Hence, we must give additional information such as the 
OFFSET of the start of the array with respect to the larger 
array and I, the level of the item to be moved. The OFFSET 
and level have been defined in such a way that the outer call 
should be made with these values equal to 0. Hence if the user 
ignores them which he is instructed to do and passes only one 
argument, the array, he will get the correct results. 


eae ee en 

(| Program (1 Although PERM can be modified to permute 
B 12.3 li strings, we here seek an algorithm 
1 PERMS 11 specifically intended for use with the 
——————Á— — string data type in hopes of obtaining 


something simpler if not more efficient. As we recall from 
Chapter 3, a permutation can be regarded as a positional 
transformation and hence can be programmed to run rapidly via 
the REPLACE function. Thus if P(S) is a permutation of the 
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string S and if X is the first n characters from  $&ALPHABET 
where n is the size of S, then 


REPLACE(P(X), X, S) 


will be equal to P(S). The difficulty, it would seem, is that 
in order to obtain P(S) we need construct the permutation 
first. But this difficulty can be surmounted by the following 
consideration. Let 


Si =  REPLACE(P(X), X, S) 
So = REPLACE (P(X), X, Sı) 
Są = REPLACE(P(X), X, So) 
etc. Each consecutive permutation is obtained by permuting 


according to P the last previously obtained permutation. It 
is customary to denote the compounding of permutations in this 
way by product notation and the repeated application of the 
same permutation therefore is denoted by exponential notation 
as: 


Si = P(S) 
S, = PP(S) = P2(Ss) 
S3 = P3(S) 


etc. One interesting question is: does there exist a permuta- 
tion P for which its various powers cycle through all the per- 
mutations. This question is answered by group theory. The 
set of permutations of n objects can be regarded as the ele- 
ments of a group (of cardinality n!) where the group operation 
is the "multiplication" described above. The question becomes, 
is the Permutation group of n elements cyclic? The answer is 
readily given as no (see, for example, Zassenhaus [ 1958] ), 
but we can produce almost as good a result by obtaining a 
small set of basic permutations, from which we can produce all 
the others. 


In what follows we will speak of rotating the first k charac- 
ters of a string one place or simply rotating the first k 
characters to mean the transformation: 


S LEN(1) . C LEN(K- 1) . S1 = sic 


In words, the first k characters are picked up, rotated once 
to the left and set down again. Thus, rotating the first 3 
characters of 'ROTATE' yields 'OTRATE'. Rotating the first k 
characters of a string is a positional transformation and can 
be done at high speed provided appropriate REPLACE arguments 
have been set up in advance. Let R(k) denote the operation of 
rotating the first k characters of a string. Then R(n) will 
rotate all the characters, and R(1) will do nothing. All per- 
mutations of a string can be obtained by a suitable combina- 
tion of R(i)'s as follows. 


TO produce the first permutation apply R(n). TO obtain the 
2nd apply R(n) again. Upon applying R(n) for the nth time, we 


will have produced the original string which of course we can- 
not return. At this point we apply R(n-1) and return the 
resulting string. On subsequent calls R(n) is applied until 
the nth time thereafter at which point R(n-1) is again ap- 
plied. Upon n-1 revetitions of this sequence of events we will 
have returned to the starting point at which time we apply 
R(n-2). So the sequence continues until, at last, there emer- 
ges an attempt to apply R(1). R(1) is a 'no-op! and this is 
the signal that all permutations have been produced. A per- 
mutation record is used to record the number of applications 
of each type of rotation. 


The idea of obtaining the sequence of  permutations by a 
suitable number of rotations was suggested by Peck and Schrack 
[1962] and suffered from the fact that Trotter's algorithm 
(which appeared later) produced a superior result for arrays. 
But in the case of strings, rotations can be programmed to be 
as efficient as interchanges. Since the computational backdrop 
is simpler for the Peck and Schrack algorithm we will use it 
to write PERMS. We have come full cycle on this one. 


PERMS (S) will permute the characters of the string S. S | 
is assumed to be at least 2 characters long and no greater | 
than the size of &ALPHABET. The argument S should be the | 
string which had been returned by PERMS on the last call. | 
When no more permutations remain, PERMS will fail. | 
A O A E A A 


DEFINE ('PERMS (S) T,N,C,K',*PERMS_INIT*) : (PERMS_END) 


E A E E EEES 
| Initialization entry point: N_R<I> will record the number | 
| of applications of R(I). FIRST_OP is an array such that | 
| REPLACE( FIRST_OP<I>, SECOND_OP, S) will be equivalent to | 
| applying R(I) to S. | 
A A ——————————————————————— ——— — | 
PERMS_INIT 


N = SIZE(S) 
NR = ARRAY('2:' N, 0) 
ALPHABET LEN (N) . SECOND CP : F (ERROR) 
FIRST OP = ARRAY('2:' N, SECOND OP) 
K = N#1 
PERMS I1 K = K- 1 
FIRST_OP<K> LEN(1) . S1 TAB(K) . S2 = S2 S1 
+ :S (PERMS. I 1) 
DEFINE('PERMS (S) I,K') 
PERMS = S : (RETURN) 


TARA D ILC LIM MM I IC ECC CIC ME, 
| Steady state entry point: Initialize K to the size of the | 
| string. | 
nr ne ne a encase 
PERMS K = SIZE(S) 

O ee a age Re gg eg a S’ ae O 
| Apply R(K); failure implies that K=1 in which case we | 
| branch to PERMS_1. i 
AS a O A A A | 
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PERMS_1 
S = REPLACE(FIRST_OP<K>, SECOND OP, S) :F(PERMS 2) 


RAM e ON CC CN CAM E M ECCO ICI MCCC CM RC EC ED 
{| Bump N_R<K>; if this number equals 0 mod K we have come | 
| full cycle; decrement K and repeat. Otherwise return S. | 
a A Pr RU P ete n TC es eae eee el 


N_R<K> = N_R<K> + 1 
K = EQ(REMDR(N_R<K>, K), 0) K- 1 :S(PERMS. 1) 
PERMS = S : (RETURN) 
E SN E Se a ADE ETE | 
| If K is 1 no more permutations remain. Fail but ready | 


| PERMS for next set of permutations. | 
EL —M mde inc e WM E MEM ew ee ee ane 


PERMS 2 DEFINE('PERMS(S)T,N,S1,S2',"PERMS INIT')  : (FRETURN) 
PERMS END 


EA een, a ee TN 


(| Program {| We define a reordering of a string S as a 
11 12.4 ii permutation which produces a new string. For 
(| REORDER |l example, the string 'AAB' has 6 permutations 
_____________4 but only 3 are distinct (determined by the 
position of 'B') and so has only 3 reorderings. Reorderings 


are usually more significant than permutations in string 
processing where repeated elements are more common than in, 
say, arrays of numbers. 


REORDER(S,OS) will produce a reordering of the characters of 
the string S where OS is an ordered version of the string S. 
REORDER can be used to cycle through every different string 
composed of the characters of a given string, starting with 
the ordered string OS. It will FAIL when no more strings 
remain. Thus, using Program 3.1, ORDER, to order the string S 
we can print every reordering of S by the statements 


OS =  ORDER(S) 
OUTPUT = OS 
LOOP OUTPUT = REORDER(OUTPUT, OS) : S (LOOP) 


Note that in the above, the previously generated string is 
used as the next input. 


It so happens that ORDER(S) will place the characters of S in 


alphabetic order. It is not necessary to be so strict. In 
fact, all that is necessary is that the ordered string contain 
like characters in adjacent positions. Thus if the string is 


'MISSISSIPPI', then  'SSSSIIIIPPM'! will be a suitably ordered 
version. 


The number of reorderings of a string can be substantially 
less than the number of permutations. Let N be the length of 
a string S having n different characters. Let there be k, 
instances of the first character, kə instances of the second, 
etc. Then the number of reorderings is 
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For 'MISSISSIPPI' the number of reorderings is 


11! 
—— 2 BSO 
4! 4! 2! 


It would take about 48 pages to print all the reorderings of 
'MISSISSIPPI'. TO print the permutations would require about 
50,000 pages. 


| ORE EE IE RN I EE A GIN ELLA NM ESO ee AN 
{| REORDER(S,OS) is used to produce the next permutation | 
| (with repetitions) of the string S. OS is an ordered ver- | 
| sion of the string S. It is called recursively. | 
a i ap a ce A | 
DEFINE (*REORDER (S,ORDERED_S)C, FRONT, S1, LAST,D,OS') 
: (REORDER_END) 


E E E OS 
| Entry Point: Obtain in C the last character of ORDERED S. | 
| If no such character exists, S must be the null string. | 
| Since this has no reordering, we fail. 
NP HR —————————————s9€ 
RECRDER ORDERED S RTAB(1) LEN(1) . C : F (FRETURN) 
ng AS 
| Then work any character of type C toward the front of S. | 
| First remove the characters of type C (if any) that al- | 
| ready are at the front of S. | 
 —  —————————————— ——————————————————————— '«"——————— | 

S (SPAN(C) | NULL) . FRONT = 
| CODE ELE EE RADICE I ME MESI ESO DEED ELIQ CCELI UR IM ee ee ee ee ee 
| Look for an interior C and interchange it with its | 
| predecessor, grouping in with C all the characters ob- | 
| tained previously in FRONT. If an interior C cannot be | 
| found, go to REORDER 1. i 
A a A arterial 

S ARB . S1 LEN(1) . D C = :F(REORDER 1) 

REORDER = S1 FRONT CDS : (RETURN) 
A E M I C E LII TOS. 
| If all characters of type C have been worked toward the | 
| front, control flows to REORDER 1. Here we recursively | 
| obtain a new sub-ordering and put all the characters of | 
| type C on the back end. l 
AAA '——— 9 —X 
REORDER 1 ORDERED S BREAK(C) . OS 

REORDER = REORDER(S,OS) FRONT :S(RETURN) F (FRETURN) 

REORDER_END 


Epilogue 


We normally make concessions to the aim of providing the sim- 
plest possible calling sequence, feeling that simplicity and 
convenience are two of the most desirable qualities that a 
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program have. Strictly speaking, the second argument to 
REORDER is unnecessary inasmuch as the second argument can be 
reconstructed unambiguously from the first. But in the in- 


terest of avoiding gross inefficiences the second argument is 
made mandatory. 


700A 

(| Program || As we have stated earlier, some applications 
E 12.5 11 require permutations to be lexically 
E LPERM E ordered. This added restriction complicates 
A AAAAAAAA«— the problem of permuting slightly; several 


solutions have been proposed. One by Shen [ 1963] has been 
found [Ord-Smith 1967] to be the "best and fastest" of a num- 
ber of lexical permutation algorithms. It operates as follows. 
Obviously the first permutation is the string in lowest al- 
phabetical order, i.e. the one produced by ORDER. The next 
permutation is obtained by interchanging the last 2 charac- 
ters. It is also clear that the last permutation will be the 
one in reversed lexical ordering as shown below: 


ABCDEF 
ABCDFE 


FEDCBA 


TO obtain the next higher lexical ordering we find the smal- 
lest sized suffix that can be increased lexically. This is 
done by scanning from right to left looking for a character 
smaller than the previous character. This we call the pivotal 
character. All characters to its left must remain unchanged. 
The character moved in (from the right) to take the place of 
the pivotal character must be the next higher character to the 
right of the pivotal character. This is called the replacement 
character. All other characters in the suffix must be placed 
into the lowest lexical state. This is most easily done by 
interchanging the pivotal character with its replacement and 
reversing all characters other than the replacement. An exam- 
ple of this operation is shown in Figure 12.1. 


LPERM(S) will return the reordering of S next higher in lex- 
ical order. It uses the Shen algorithm modified for SNOBOL4. 
If no lexically greater permutation exists for S, LPERM will 
fail. to obtain all reorderings of a string the previously- 
returned string must be passed as argument; the initial argu- 
ment must equal ORDER(S). 


| XS IMEEM CC C D CL EMG E A me a ee eee MMC CR CN CCCII C CC CMS 
| LPERM(S) returns the next reordering in lexicographic | 
| order of the string S. | 
————————— ——— ————————— "—————— ———— " PA———Óe!—À—— mÉÀM"ei& 


DEFINE (*LPERM(S) P,T,X,R,Y,HIGHS') 


eee ee eee O cee ee ee Gee OE A a A EEE EP EE O ae ED O TN A A O AO AA. 


pivot replacement 
l | 


R C E E D A 


Figure 12.1 


An example illustrating the method used by  LPERM 
to obtain the next permutation in lexical order. 


a ICM MCCC COLE QM TT Ge eee Pe CD ee ME ee ee 
| Find the alphabetically highest character. l 
Ner A PT LP cp E A | 
ALPHABET RTAB(1) LEN(1) . HIGH CHAR 
: (LPERM_END) 


ERA AA C cc x A OO AN A I EC IN PI A E SA E oe 
| Entry point: Reverse the string to make scanning from the | 
| back end easier. Also place dummy character onto end so | 
| that unevaluated expressions work. | 
A 


LPERM S = REVERSE(S) HIGH CHAR 
| ORAN CIC UE RI a eee a ee Oe CMS Cc IM NM tg C CX RDUM ae a 
{ Look for pivot character (P). If none can be found the | 


l argument was in its highest lexical state. We therefore | 
| fail. I 
a A————Á————————————— | 

S LEN(1) $ T LEN(1) $ P *LGT(T,P) :F (FRETURN) 
SSS ee aa 
| Search &ALPHABET for the set of all characters > P. Call | 
{| them HIGHS. Then search S for the replacement character | 
| (R). | 
cc ULT ICM UC DSN a ASE | 

& ALPHABET BREAK (P) LEN (1) REM . HIGHS 

S | BREAK(HIGHS) . X LEN(1) . R BREAK(P) . Y LEN(1) 
* = . REVERSE(X P Y) R 


|a gu D IE C OC DE C E A OMM EP IDA DIN age SI DC RE 
| Reverse the entire string back, remove the dummy character | 
| and return. | 
¡AE IAE O SN A a a ee a E MMMEEE M 
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LPERM = REVERSE (S) 
LPERM HIGH_CHAR = * (RETURN) 
LPERM_END 
Names_referenced Name Type Where defined 
by LPERM: REVERSE Function Program 3.6 
Epiloque 


The most single interesting part of LPERM, from the implemen- 
tation point of view is the search for the pivot element. Here 
a search is made for 2 consecutive characters such that the 
first is lexically greater than the second. This is done using 
dynamic assignment (the binary $ operator) and an unevaluated 
expression (*LGT(,)). TO make this work under the normal 
quick-scan mode, a character had to be appended to S. This is 
because the scanner assumes that *LGT will match at least one 
character (which it does not) and would prematurely fail 
without testing if no more characters remained. The character 
appended (viz. HIGH CHAR) was chosen in such a way that the 
algorithm will work whether or not the one-character assump- 
tion is made. 


add E EI M MR CES | 

{{ Program || A permutation vector is a sequence i, ig ... 
E 12.6 E in containing one each of the numbers 
li IP N lila) If P is a permutation vector 
AO MN LASA (in the form of an array) then AI(A,P), 


where AI is Prog. 4.6, will return an array in which the ele- 
ments of A have been permuted according to P. That is, the 
element in position P<i> will be moved to position i. Let 


B = AI(A,P) 


If P is a permutation vector there must be another permutation 
vector Q such that A = AI(B,Q). Q is called the inverse of P. 
One description of Q is as follows 


Q<j> = k if and only if P<k> = jj 
This suggests that Q can be created as follows 


Q = COPY(P) 
SEQ(' Q<P<K>> = K!, .K) 


(SEQ is defined in Prog. 4.3). For very large arrays we may 
find that it is necessary, or at least highly desirable, to 
invert the permutation vector in place and thus avoid the 
creation of additional storage. One way to do this is to 
recognize that every permutation consists of a sequence of cy- 
cles. Thus, the permutation vector (5,3,1,6,2,4,7) will have 
cycles as indicated in Figure 12.2. 
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A | 

— I a 
i | | i 
v v { { | v v 

«1» «2» «3» «u» «5» «6» «1» 


gp oq du MEE sea: UN: pecas DU GEM Gt 


15113113?911$951121011078517 (17? I 
EJ beu. Ez Uu ee Ele. e 
| | A A A | | 


{ —— EA | —————4 | ee | 


Figure 12.2 is drawn by directing an arrow from box i to box 
P<i>. For example P<1> is 5 so that an arrow is drawn from 
the first box to the fifth. A permutation vector has the 
property that each box will have exactly one such arrow direc- 
ted in and one directed out. From this it follows that each 
arrow will form part of a closed loop and that the entire 
graph is a collection of non-intersecting closed loops. Thus, 
permutations can be completely characterized by their loops. 
The vector of Figure 12.2, for example, can be described as: 


(5s 2,3, 1) (6,4) (7) 


The inverse permutation can be obtained by reversing all ar- 
rows. This is most conveniently done by reversing all the 
arrows in a given loop much in the manner used to reverse a 
list  (REVL, Prog. 5.3). When elements in a given loop are 
reversed they are made negative to indicate their reversal. 


[ELTE Y. r1 eq wr ee ee uL ge ae ee MC PE ES S PIA QT AA IR 
( IP(P) will invert a permutation vector contained in the | 
| array P. No additional storage is consumed. | 
A —— — ———————————— OE EEEE S A O | 


DEFINE ('IP(P)M,PM,K,PK, PPK') : (IP END) 


 NIDZ EE EQ CM MGE CMM C MEI CPC MAC TS DUAE CM MM CMM CIIM INIM M D MEI I RS | 
| Entry point and outer loop: Bump M by 1 looking for a non- | 
| negative value in P<M>. Such a value indicates the start | 
| of a cycle. Array elements already inverted are denoted | 
( by negative values. When M runs out, we are done. | 
——————————————————————————— M——Ó——— sá—( 


IP M = M# 1 
IP = P<M> P : S (RETURN) 
P<M> = LT(P<M>,0) -P<M> : S (IP) 


| MESE CMM RCM CMM ECL C MC MMC S "ccc CUM C LEE MIN LE Lc MMC MK CM GE ER M EE 
| If PM = M then we have a trivial cycle. Go back.  Other- | 
| wise, we let K sequence through the cycle starting at M. | 
Le ee A | 
EQ (P<M>,M) :S (IP) 
K = M : PK =  P<M> 


pe AAA SE CEES SS CE CA aS Re 


re IM MM E DON A. 
1 Go through loop setting P<P<K>> = -K. Care must be taken | 
| to save the value of P<P<K>> before it is overwritten. The | 
| loop terminates when we arrive back at M. | 
| -—————— ————— AR— a EE E E Oe re eS: 


IP LOOP PPK = P<PK> 
P<PK> = -K 
K = PK 
PK = PPK 
EQ (PK,M) :F (IP_LOOP) 
P<PK> = K : (IP) 
IP_END 
Epiloque 


IP has been adaoted for SNOBOLU from an algorithm by Medlock 
(1965] and Boonstra ( 1965]. See also Knuth [Vol.1, 175] for 
another inverse permutation algorithm. 
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OTTER 
| Exercise 12.1 | Give the permutation numbers for the 
AS records below (provided they are valid per- 
mutation records). 


a) (0 1 2 1) 
b) (1 2 1 0) 
C) (0 1 2 3) 
d) (1 3 2 u) 
e) (0 0 0 1) 
TS | 
| Exercise 12.2 | Compute the permutation record of the fol- 


LLL———————————-—A4A lowing permutation numbers: (a) 6, (b) 3, 
(c) 13, (d) 26. 


Se 
| Exercise 12.3 | Write a SNOBOLU program to convert a per- 
3 mutation record in V to a permutation num- 
ber I. Assume the record is a string containing numbers 


separated by commas as in '1,2,1,3,'. 


q^ mm 

| Exercise 12.4 | Define the sum of 2 permutation records as 
AS the permutation record of the sum of the 
associated permutation numbers. Write a SNOBOLU program to 
determine the sum of 2 such records. Assume the records are 
in the form indicated by the previous exercise. 


T oe | 
| Exercise 12.5 { Prove that the permutation number of 
AS (1,2,3,...,n-1) is n!-1. 


A 
| Exercise 12.6 | The permutation number can alternatively be 
AS defined as 


I = i,(n!/1!) + i2(n!/2!) +... + in(n!/n!) 
Devise an algorithm to extract the record given I. 


a | 
| Exercise 12.7 | On the first time through the loop of 


AS PERMUTATION what will be the values as- 
Signed to RADIX, N, S1 and I? 


pom , 
| Exercise 12.8 | What is the associated permutation record 


AM of I and what value is returned by 
PERMUTATION('ABC', I) as I ranges from 0 through 5? 


E | 
| Exercise 12.9 | Let S be a string of 6 characters. Obtain 
LL——————————————24 the reverse of S by a call to PERMUTATION. 


MAA ED 
| Exercise 12.10 | Rewrite PERMUTATION to operate on arrays. 
| a ee ne eee TOOTS | 


es A EN 

| Exercise 12.11 | In the call to PERMUTATION, one may escape 
AS the problem of limited arithmetic preci- 
sion by denoting the permutation number as one long string as 
in 


PERMUTATION (S, *'32564117246785!) 
Assuming that the length of a string is no greater than the 


largest integer what statements within PERMUATION would have 
to be modified to permit these extended integers? modify them! 


| p ee ee OT 

| Exercise 12.12 | Let C(n) be the average number of calls to 
5 PERM (both external and internal) per per- 
mutation of an array of n elements. For example, if PERM were 
non-recursive, C(n) would be 1. 


(a) Write an expression for C(n) in terms of C(n-1). 


(b Assuming that C(1) » 1, use a) to compute C(2), C(3) and 
C (4). 


(c) Prove that if C(n) < C(n- 1) then C(n*1) < C(n). 
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(d) On the basis of (a), (b) and (c) what value does C(n) ap- 
proach as n approaches infinity? 


(e) What conclusions can you draw with respect to the use of 
recursion to program PERM. 


pc 

( Exercise 12.13 | PERM can be extended to handle the special 
t-—————————————4 case of arrays of length 1 by the inser- 
tion of a single instruction. What is the instruction and 


where should it be placed? 


S TUM SER Ser ee ee 
| Exercise 12.14 | what error in PERM will arise if its argu- 


t——————————--————-J2 ment is an array with only one element? 


———— 
| Exercise 12.15 | PERM may be modified to permute a global 
AM string (say G.S) rather than an array by 
changing cnly two statements (in addition to perhaps adding 
temporary vi¿riables). What are they and suggest modifications. 


eg weed ee AO. 
| Exercise 12.16 | Modify PERMS so that if it is called with 
t the null string it will be reset. 


SSS SS 

| Exercise 12.17 | In using PERMS to permute the string 
A M(MA A 'LEMON*, let us denote 'LEMON' itself as 
the Oth permutation. The next value returned is called the 
first permutation, etc. What number permutation is (a) 'MELON' 
and (b) 'EMLON'? 


| zung cmd DKL Gy MC E UME CE EUR | 
| Exercise 12.18 | Give the smallest sequence of k-rotations 
 -MMMIMIMMM4MIMNS, (denoted R(k)) to permute the characters 


'LEMON! to 'MELON'. 


| xd Gc MCI MEET 
| Exercise 12.19 | How can REORDER be modified so that it re- 


t———————————————-4 quires only one argument. Assume that the 
first string given is in alphabetic order (as returned from 
the ORDER function). 


(UTE iu Pei an T 

| Exercise 12.20 | Write a function REORDERING(S,I) which 
VS» will return the Ith reordering of the 
string S. That is REORDERING(S,0) will return ORDER(S), etc. 
Pattern the function after PERMUTATION (S,I). Do not merely 
call REORDER I times as this would be grossly inefficient. 
Hint: the number of ways of interspersing K identical charac- 
ters into the n*1 positions of a string of length n is given 
by the binomial coefficient: 
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n+k (n+k) ! 

C = 7 -2------- 

k n! k! 
Ce ey N 
| Exercise 12.21 { Will the function LPERM (Prog. 12.5) 
AÑ produce all permutations or all 


reorderings of a string with repeated characters? Why? 


EA a ee ee 
| Exercise 12.22 | Permutation vectors may be regarded as 
AJA elements of a group under what operation? 


RS | 
| Exercise 12.23 | Let I be the identity permuation of n ele- 
AX ments. That is I = (1, 2, ... ,n). Let P 


be an arbitrary permutation vector and Q be its inverse. What 
is the value of (a) AI(P,I), (b) AI(I,P), (c) IP(I), and (a) 
AI (P,Q) ? 
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Ip ortin) on a digital computer covers a wealth of ap- 
I œ, plications, can involve a variety of data structures 
tf and devices, and has been met with a host of techni- 
—3f ques. Sorting has been widely used in business applica- 
C ——! tions where payrolls, accounts, inventories and lists 
of all kinds must be sorted by name, number, address, etc. 
But, in addition, many other data processing applications find 
a need for sorting. Examples include compiler writing where 
symbols are sorted in alphabetic order, in computational 
linguistics where dictionaries, indexes and concordances are 
prepared, and in systems programming where libraries are al- 
phabetized for rapid searching. When the items to be sorted 
can fit entirely in core storage, the process is called 
internal sorting. When secondary storage is required, it is 
called external sorting. This chapter is concerned with in- 
ternal sorting methods only. External sorting is generally 
only done when the amount of data to be sorted is large. Under 
these circumstances, SNOBOLU is not the ideal language for ef- 
ficiency reasons. 


The aggregate of things to ke sorted internally may be an 
array, a list, a string, a tree or a table. The ordering may 
be on the basis of numerical value, lexicographic value or 
number of occurrences and the ordering may be forward or 
reverse. A routine may be required to actually sort an array 
or merely return an array of indices that could then be ap- 
plied to one or more arrays. For these reasons and others to 
follow there is no one universal sort routine. Rather, each 
situation tends to be special and tends to require a sort 
tailored for the application. 


The distribution of the input items may not be very uniform. 
There may, in fact, be strong correlations present in the to- 
be-sorted aggregate which, if taken into account, could im- 
prove the sorting time. Not all algorithms are equally adept 
at taking advantage of an almost-ordered input array. With 
some algorithms, almost-ordered . data can actually adversely 
affect sorting time. 


Another factor associated with the distribution which can in- 
fluence the choice of sorting algorithm is the degree to which 
there is repetition in the data to be sorted. For example, in 
the preparation of a book index or a word concordance, the 
number of repeated items is high. There are sorting techniques 
which work quite well in such circumstances and their use can 
reduce sorting times substantially for this kind of problem. 


The sorting situation is somewhat influenced by the nature and 
amount of so-called passive information which must undergo the 
same permutation as the input array, but which does not par- 
ticipate in the determination of the new order. For example, 
if we are sorting the payroll by location we presumably want 
to bring along with the location other passive information 
such as name, payroll number, salary, etc. Such ancillary in- 
formation may take many forms. The passive information may 


appear in a separate array. Or the active information may be 
embedded in the passive information as for example when card- 
image strings are to be sorted on the basis of certain 
columns. Or the passive and active information may appear as 
fields of programmer-defined data objects. The way in which a 
sorting method handles equal items may be crucial in certain 
applications where passive information is present. 


The reason that sorting is done at all is usually to 
facilitate later lookup by either man or machine. Imagine the 
difficulty one would have if all the names in the telephone 
book were scrambled chaotically. To search the telephone book 
for an entry we would have to make what is called a linear 
search comparing each name one after the other until the 
desired entry was found. The time required would be, on the 
average, the time to make n/2 comparisons, where n is the num- 
ber of items in the book. On the other hand, if the book is 
alphabetized we can do a so-called binary search. We can look 
at the middle item and decide whether the desired name occurs 
after or before this middle item. Regardless of the outcome 
of this initial test, we can again probe the middle element in 
the segment known to contain the name and, in such a way, nar- 
row the search by half at each comparison. The number of com- 
parisons in this latter case is loggen. When n is large the 
difference between logan and n/2 is truly impressive. For n 
equal to 10000, loggn is only 13 whereas n/2 is 5000. 


An appreciation of the difference between a quantity which 
grows linearly (such as n/2) and a quantity which grows 
logarithmically is needed to understand the significance of 
some sorting methods and some formulas expressing their com- 
putational requirements. To further underscore the distinction 
between linear and logarithmic growth, the latter quantity 
grows only as fast as the number of digits needed to express 
the former. Thus logan not merely grows more slowly than n 
but becomes extremely sluggish as n grows large. 


As we have outlined here, there is a rich variety in the kinds 
of sorts that one might be called upon to make. We will not 
try to give a complete and exhaustive set of programs which 
could handle every conceivable situation. We will, rather, 
present a few general methods, and give a few specific exam- 
ples and hove that either these, or suitable modifications of 
them, will serve any given sorting need. 


More complete sources of information on sorting are available. 
Flores [1969] and Knuth (Vol. 3] have written books on the 
subject. An entire CACM issue has been devoted to sorting 
[Sorting Issue, 1963]. An excellent early summary of sorting 
techniques is given by Friend [ 1956]. A recent bibliography 
is given in Lorin [ 1971]. 


Sorting methods generally subdivide into two categores, inter- 
nal and external. The internal sorts are subdivided again into 
two categories, comparison sorts and distributive sorts. 
Generally speaking, comparison sorts sort on the basis of 
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pairwise comparisons between elements. Distributive sorts are 
anything else. 


SEXE OMPARISON SORTS | A comparison sort works by succes- 
$ r sively comparing pairs of items to 
$ | be sorted. The values of the items are irrelevant 
% | other than as to how they compare with each other. 
{| Thus, a comparison sort will operate in precisely 
L————J the same way if one is sorting strings or numerical 
values. Indeed, a comparison sort can be used effectively to 
sort data objects of any kind provided an operation can be 
written which compares the two items. 


Before considering the various methods of sorting it will be 
well to obtain some idea of the basic computational neces- 
sities involved in a comparison sort. If we assume that every 
permutation of the input array is equally likely, then we can 
use an information-theory argument to determine a lower bound 
on the average number of comparisons needed. There are n! ways 
of permuting n objects. Therefore the input array (of length 
n) can be thought of as encoding a message containing logon! 
bits. Since one comparison yields one bit of information and 
Since in order to sort we need complete information concerning 
the permutation, we may loosely conclude that at least logon! 
comparisons are needed on the average. Using Stirling's ap- 
proximation formula (Knuth, Vol.1, p.46] we obtain 


.5 nt.5 -n 
logən! (appr.) = logọə(2 PI n e ) 


1.33 + n logan + .5 logan - 1.43 n 


(appr.) n (logən - 1.43) 


Moreover, for large n (say n > 1000) 
logən! (appr.) = n logən 


The information theory argument may be made rigorous by the 
following line of reasoning. Suppose we wanted to communicate 
to a distant location the contents of a permutation vector P. 
If P has n elements and if all permutations are equally likely 
then this will require logan! bits (on the average). That this 
is true is intuitively plausible. For a more general and 
rigorous treatment of the subject consult any textbook on in- 
formation theory. For example, see Reza [1961], p.148. This 
granted, assume that we have a comparison sorting algorithm 
(Algorithm S) which uses a predicate COMPARE(X,Y) to obtain 


information about the array it is sorting. But no other in- 
formation about the value of the elements of the array are 
available to S. If we allow Algorithm S to sort P it will 


transform P into I, the identity permutation vector 1,2,...,n. 
Now at a distant location set up Algorithm S to sort the ele- 
ments of I using the comparison bits tapped from the sorting 
of P. This setup is shown in Figure 13.1. The result of this 
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is that I is transformed into the inverse of P so that we have 
effectively transmitted P. Since the information transmitted 
must be at least logan! bits on the average we know that we 
must have at least loggn! comparisons on the average. 


Communication 
pU a EE EE Link a rE 
l | aa eee te elena are Dr eae E E | 
r——4 | | : m~ | : | 
| i | { : | | : | 
l Ii X { : | (—1! X : | 
l lla rra Al : | (llan mema: A iC 
I Pie {| +! i tt : | rte. | t! ls | | 
{ l. f | COMPARE |{—+.]...3 | l. | | COMPARE 14H | 
| |. —1 | | | | l. 1—1! i 06/1 
| Y Y EE y| | (i—-| Y |e vw 
i Algorithm S ( | Algorithm S | 
| | { l 
A A EEES | AAA E | 


Figure 13.1 


An information theoretic argument for showing that 
sorting requires logsgn! comparisons. 


It is important to understand what the formula says. It does 
not say that we must necessarily make this many comparisons in 
any given instance. We must, rather, make this many com- 
parisons on the average if the permutations are equally 
likely. From this observation we can deduce that if the number 
of comparisons which are to be made is independent of the 
distribution and only dependent on n (the number of items) 
then the method must make at least log, n! comparisons if it 


is to work for all possible distributions. 
There are four principal kinds of comparison sorts: 


Interchange 
Merging 
Selection 
Insertion 


NTERCHANGE SORTING | Given an array, the elements of the 
e array can be pair-wise interchanged 
{| until the elements are sorted. This has the advantage 
| that no additional storage need be allocated. Moreover 
| no other sort type has this property. But every inter- 
t—I change sort has some flaw which makes it unacceptable 
for some applications. 


A MEER 

(| Program {|| The simplest kind of interchange sort which 
E 13. 1 E is of any interest is the so-called bubble 
E BSORT N sort. In the bubble sort the first and 
— A second items are compared; if they are out 


of order they are interchanged. This sorts the first 2 items. 
TO sort the first K items assuming the first K-1 items are 
sorted we 'bubble' the Kth item down through the sorted list 
of K-1 items searching for its correct insertion point. This 
takes an average of approx. K/2 comparisons to insert the Kth 
item and approximately N(N/4&) comparisons to sort N items. 
This is really too many, yet the popularity of the bubble sort 
persists. This is due to several factors. The bubble sort is 
easy to program and understand. Also for small N the figure 
N(N/4) is not much greater than N logs N. Hence the the bubble 
sort is reasonably fast for N = 25 or so. But as the number 
of items increases the bubble sort departs severely from the 
ideal. At N = 100, the bubble sort requires 4 times as many 
comparisons. For N = 1000 the ratio is 25. 


Sorting routines, like the bubble sort, whose comparisons are 
dominated by the factor N? are called quadratic. Sorting al- 
gorithms which obey an N logaN law or differ by a propor- 
tionality constant are called logarithmic. Though inefficient 
for large N, a quadratic sort can be more efficient than a 
logarithmic sort for small values of N (less than 10 or so). 
For this reason a logarithmic sort may use a quadratic sort as 
a utility routine for the purpose of handling small arrays. 


For medium values of N the bubble sort can save time if the 
array is almost sorted to begin with. The bubble sort, more 
than most, takes advantage of any pre-existing order in the 
array. 


ee we GEN, ns CMM SC ee Ee ee ee ee ee A MEA 
| BSORT(A,I,N) will sort (via a Bubble sort) in ascending | 
| lexical order the strings in the subarray AXI», AXI + 1>, | 
| e.s., A<N>. CAUTION: Bubble sorts may be time consuming | 
| for large arrays. | 
| ema c ect tqq p Ecc UC ERE NR ME CEN OC ANO CC RPM 
DEFINE('BSORT (A, I,N) J, K,V') : (BSORT. END) 
Ges ee ce RU RE WI EU RU EU I UCET UM AMI TEIL oe 
| Entry point: J will hold the index of the item to be | 
{| bubbled. | 
| AA EE AAA E AI TUN CSNET 


BSORT J = I 
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| Nia CA E ee SC PM CIA CM KM D CHEN MC INI CM MN IC C E MCCC DK MI MCCC DM CEDERE | 
| Outer loop: Loop on J. V is the value of the bubble. | 
LAA A CN a ee ed 


BSORT_1 J = J+ 41 IT(J,N) : F (RETURN) 
K = J 
V = AXJ> 


ee ae ee eee eee RO 
| Inner loop: Loop on K. we bubble down into the lower | 
| portion of the array looking for a place to insert V. | 
| — ———À— —— 
BSORT 2 K = K-1 GT (K,I) :F (BSORT_RO) 
A<K + 1> = LGT(A<K>,V) A<K> :S(BSORT. 2) 
A<K + 1» = V : (BSORT_1) 


Cg A A ae ee TE. 
( On runout, plunk bubble into bottom and go back to outer. 
{ loop. | 
ae a NC DUNS ee eR aN e | 
BSORT RO | AXI» = V : (BSORT. 1) 
BSORT END 


SE a ae ee ey 

| Program |l An interchange sort which is logarithmic 
| 13.2 N rather than quadratic is one introduced by 
| 1! Hoare [1961] and improved by Hoare [1962] 
——— M and Scowen [ 1965]. It is frequently called 
QUICKSORT. The basic idea is to interchange the elements of 
the array until they are partitioned into two groups, A and B, 
such that 


^ 
| 
1 
l 


(i) Each element in group A lies lower (i.e. has lower index) 
than every element in group B. 


(ii) Every element in group A < every element in group B. 


Note that A and B need not be equal in size. If groups A and 
B are then sorted separately the entire array will be sorted. 
The sort routine therefore consists of partitioning the array 
followed by two recursive calls to sort the partitions. 


One method of partitioning is to pick the middle element and 
use this as a criterion to separate the lows from the highs. 
The elements of lower index are examined one by one for an 
element that is > this criterion. The elements of higher index 
are searched from the top down to determine if any are < this 
criterion. When found the elements are interchanged and the 
search goes on. Eventually the two pointers cross at which 
point the partitioning is completed. 


For each partition there are approximately n comparisons where 
nis the size of the array to be partitioned. Hence the number 
of comparisons is n times the average depth of the recursion. 
Ideally this is logon. Hence, ideally the number of com- 
parisons approaches n logon. But this ideal is reached only 
if the criterion is always chosen so that it partitions the 
array in half. For randomly chosen criterion the figure for 
the number of comparisons is approximately 1.4 n logan [Hoare 
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1962]. This factor of 1.4 also shows up in the analysis of 
one of the insertion sorts. (See Exercise 13.13). 


HSORT is not particularly fast for arrays with a small number 
of items. Ideally, when the array is small, BSORT should be 
called. This is explored in an exercise. 


The algorithm given here differs somewhat from Hoare [1961] 
and is such as to reduce the size of the program at the ex- 
pense of a small increase in running time. 


q lA CI EIN GUMMI I C A CC Cx NEC MM ELE CD KI C DUDAS A SES as TU 
| HSORT(A,I,N) will sort the strings in array AXI», A<I + | 
( 15, ..., A<N> in ascending sequence. HSORT calls itself | 
| recursively. | 
A A e—— es ee eed 


DEFINE('HSORT (A, I,N) J, K, CRITERION!) : (HSORT. END) 


E a CEDE C IR AIME M IX END CIC A I DIA REISE, | 
| Entry point: If more than 2 items remain skip. If only 1 | 
{| item is to be sorted, just return. | 
———————————————————— ———— ———— ——— —n—— 9S 
HSORT GT(N - I, 1) :S(HSORT LARGE) 
GE (I, N) : S (RETURN) 
(LGT(A<I>, A<N>) SWAP(.AXI», .AX<N>)) : (RETURN) 


a ee a a CI MCI ee A CL CCS 
| Obtain CRITERION to be used for partioning array into 2 | 
{ groups. | 
 ——— — À———————cuÓ€— Ene c '—SÍ———— Á—— ac ———————"——(—— A A — ts» | 
HSORT LARGE 

CRITERION = A<(I + N / 2> 


SS a Fe E E ETA. 
| J will move through the array from the bottom looking for | 
l an element > CRITERION. K will move through the array from | 
| the top looking for an element < CRITERION. | 
Ü 


J = I- 1 

K = N+1 
HSORT UP J = J+ 1 

~LGT (CRITERION, A<J>) :F(HSORT UP) 
HSORT DOWN K = K- 1 

=LGT (A<K>, CRITERION) :F(HSORT DOWN) 


cM AAA MCI IC EL SS CI LC c CC DES 
| If J is still < K, interchange and go back. i 
AA A "CT AAA a VE C Pp nEScEPIECM 
(LT(J,K) SWAP(.A<J>, .A<K>)) :S(HSORT UP) 
E BI i O E M Y ee a Be 27 X la. P arca EE Tq eer 
| Otherwise, we are done partitioning the elements. K will | 
| serve as a convenient dividing line. Sorting will be ac- | 
| complished by sorting the 2 subarrays. Might as well use | 
| HSORT to do this. | 
 ————————— —— A ———— — s C————!— —Á— À——— 
HSORT (A, I,K) 
HSORT(A, K * 1, N) : (RETURN) 
HSORT END 
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Names referenced Name Type Where defined 
by HSORT: SWAP Function Program 3.14 
Epiloque 


A difficulty with the Hoare sort is the possibility that equal 
items will not retain their relative order. In the subroutine 
given, this makes no difference since such an inversion will 
be undetectable by the user. But in sorting structures, for 
example, this property could prove to be a critical defect. 


4$ ERGING | Merging is not strictly a sorting technique. 
£88 .———— It is a technique whereby two sorted ag- 
$ | =gregates can be combined into one sorted aggregate 
% | by the simple process of selecting and incrementing 
š% | the aggregate showing the current least value. But, 
CA merging may be converted into a sorting technique 
in the following way. Let the final sorted aggregate of length 
n be the result of merging two sorted aggregates of length 
n/2. Let each of these be the result of merging two aggregates 
of length n/4, etc. Ultimately we reach a point at which the 
aggregates have length 1 and can be regarded as being sorted. 
The merged sort is quite efficient and approaches the 
theoretical lower limit on the number of comparisons needed. 


ee 
(|! Program |! The aggregate merged in the merge sort can 
E 13.3 NE be any collection of information accessible 
1 | LSORT E in serial fashion and hence it is a favorite 
Bme way of sorting such serial aggregates as 


files and lists. LSORT will sort a linked-list in ascending 
sequence according to the value contained in the VALUE field. 
If HEAD is the head of the linked list then LSORT(HEAD) will 
sort the list and return the new head. LSORT does not allocate 
new storage; it just rearranges pointers. 


A C I ng Ie en S "Acc LC IDE M ee A oe ee, ee 
{| LSORT will sort a linked list L using a merge sort. The | 
| caller may specify the name of the value field, the next | 
| field and the predicate. Default names are VALUE, NEXT | 
| and IGT. I 
——————————————————————— —————— E EE | 


DEFINE ('LSORT (L, VFLD, NFLD, PRED) L1, L2, PTR‘) 


| usd CM a A AS CE C ER E A ae 
| LSORT uses the auxiliary function LSORTA which is called | 


| recursively. I 
A C ( — —— "—  ———— ÀJ T !——oÀ——(—T|À— UGMMÜ | 
DEFINE (' LSORTA (N) I!) : (LSORT. END) 


MESURE ML CM "INC E Oe MCCC IMP MEG EI eee ee ee ae 
| Entry point for LSORT: Give default names. Then make the | 
{ fields used in the program synonomous with these. | 
EL IL ——— MM La M M C AN | 


LSORT VFLD = IDENT(VFLD) "VALUE! 
NFLD = IDENT(NFLD) ‘NEXT! 
PRED = IDENT (PRED) 'LGT' 


OPSYN('VFLD', VFLD) 
OPSYN('NFLD', NFLD) 
OPSYN('PRED', PRED) 


AN fe ee ge KM a ee te MEE. 
| Calling LSORTA with an argument of 0 will sort the entire | 
| list. | 
i a A | 
LSORT = LSORTA(0) : (RETURN) 

A A AEEA 
Entry point for LSORTA:  LSORTA(N) where N is a power of 2 | 
will return a sorted list comprised of the first N links | 
of the list L (or all of the list if fewer than N links | 
remain). The variable L is treated as global and is al- | 

l 

l 


XN 


tered. If N is 0 the entire list will be sorted and 
returned. 
| AAA AAA EA ARI qr ed TI ee | 
LSORTA IDENT (L) :S (FRETURN) 


SS E ae eee pee LC ae ae 
| Remove exactly one link from the head of the list. If N= | 
| 1, then we return immediately. | 
| pee ae er NU C TuT A c crc a cg pc ——CIE————— | 

LSORTA = L 

L = NFLD(L) 

NFLD (ISORTA) = 

I = 1 
LSORT_1 EQ (N, I) : S (RETURN) 


ee Me ECC LC MC MM ENIM I ENDE O ee Te ee eee 

| Otherwise our list is not sufficiently long. Let us obtain | 

| another list of length I and merge the two. If Lis null, | 

| we are done. i 

AAA | 
= LSORTA (1) : F (RETURN) 

LI = LSORTA 


SS AAA AAA. 
| Merging kegins here. PTR will point to the receptacle | 
{| which will receive the next item. Flow goes to LSORT L1 | 
| if the next item is to come from list L1; otherwise, flow { 
( falls through. | 
[eee a aCe eS eS E A A A EAE EEEE 
PTR = .LSORTA 
LSORT_C PRED (VFLD (L1) ,VFLD(L2)) :F(LSORT L1) 


| ocius DLE LIED LOL: -: c: : AA. 
{| Choose L2; update PTR and L2; loop unless runnout in | 


| which case the entire 11 list is appended. | 
| eset Boo UE une- II E NOTER EUN Rc CPC ES GRUT EMEN T ME CEU CUN WO ULM 


$PTR = I2 

PTR = .NFLD(L2) 

L2 = NFID(L2) 

IDENT (L2) :F(LSORT C) 
$PTR = L1 : (LSORT_DONE) 


Es 
| Choose L1; similar comments as above apply. 
AS A 
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LSORT L1 $PTR = L1 

PTR = .NFLD(L1) 

LI = NFLD(L1) 

IDENT (L1) :F (LSORT_C) 

$PTR = I2 
rs a ey A ne we Ra MN IM CN pee ope ee ee DD NN eg ee ee ee 
| Our list (beginning at LSORTA) is now twice as long as it | 
| was. Record this in I and loop back to see if this | 
| suffices. | 
A A a ——— ————————ÀG 
LSORT_DONE I = I * 2 : (LSORT_ 1) 
LSORT_END 
CSS yO | 
(! Program || The function MSORT is a sort based on the 
E 13.4 1 | merging principle. A call to MSORT requires 
Vf MSORT li only one argument, the array of strings to 
E me aeee be sorted. It assumes the array has a lower 


bound of 1 and obtains the upper bound by a call to the 
prototype function. 


MSORT(A) will not sort the array A but will return an array of 
integers (i.e. a permutation vector) which can then be applied 
to the array A and any pasSive array by using AI (Prog. 4.6). 
Thus if A is an array of names and if B is an array of (as- 
sociated) salaries then 


I = MSORT(A) 
A = AI(A,I) 
B = AI(B,I) 
will sort A and B according to alphabetic order of A. MSORT 


will sort numerical items if a second argument denoting the 
comparison predicate is given. Thus 


I = MSORT(B, 'GT') 
B = AI(R,I) 
A = AI(A,I) 


will sort the two lists by salary (in increasing order). More 
exactly, an element X in the array B which appears before an 
element Y will be placed after this element if and only if the 
predicate GT(X,Y) holds. 


The coding of MSORT is based on the sorting algorithm designed 
for APL as described by Woodrum [1969]. He defines the notion 
of a chain of subscripts as follows. Let P be an array of in- 
tegers. Then, for any integer K we have the sequence of 
integers (called a chain) 


K, P<K>, P<P<K>>, ... 
We will assume the sequence terminates by the appearance of a 


0 subscript which will cause failure in the reference. In the 
cited paper, the sequence terminates by two consecutive equal 
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subscripts. Such a sequence of integers can represent a list 
of elements of the array A as 


A<XK>, AXP<K>>, AXP<P<K>>>,7 ... 


Whereas it seems to be always necessary to allocate fresh 
storage in order to do a merge sort, the method of chaining 
permits us to merge without allocating any more storage than 
needed to contain the permutation vector. The behavior of 
MSORT is such as to form increasingly longer chains represen- 
ting sorted lists of elements of A. 


| MSORT(A,OP) uses a merge sort to returm an array of in- 

{| dices which can then be used to sort the array A. OP is | 
| the operation to be used to indicate ordering. | 
TA AA SI ES | 


DEF INE (*MSORT (A,OP) U,P,1I,K,SAVE,AI,AJ') 


RS | 
| CHAIN is an auxiliary function called by MSORT to chain | 
| the indices in the global array P<L>, ..., P<U>. It | 
{| returns the top of the chain. It calls itself recursively. | 
AS | 
DEFINE('CHAIN(L,U)I,J,MIDDLE,K') 
: (MSORT END) 


co M ————————————3 
| CHAIN entry point: If the number of items to be sorted is | 
( 1, just return the index. i 
e a a a a SERRE ERERRE.: 


CHAIN CHAIN = EQ(L,U) L 2S (RETURN) 


| MCN re ae ge ge C E E Ee eh gene Ee ee ge 
| Otherwise split the array into 2 parts, and chain each | 


| part separately. | 
———— — H————— ————— Áe€ PME ——s— M PHÓ RS" O CI SIE EOS | 


MIDDLE = (L + U) / 2 
I = CHAIN(L, MIDDLE) 
J = CHAIN(MIDDLE + 1,U) 


A x CEA aS TREES CIC SBE SSE TESS | 
| Now merge the 2 chains. The value to be returned will be | 
| either I or J depending upon which should come first. This | 
| is determined by the function CHAINOP which must be | 
| 


defined by the caller. | 
A A A NI II A A | 


CHAIN = I 

AI = AXI» 

AJ = AXJ> 

CHAIN = CHAINOP(A<I>,A<J>) J 


RARA CC MINCE CE Ee | 
| K will point to the last element in the chain being built. | 
| Then branch to increment one or the other of the 2 | 

| 


| indices. 
nn AS 
K = CHAIN 


EQ (K, I) :S (CHAIN I1)F(CHAIN J1) 
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E ECCE ICI eee ee Te O FT A Pee NM CN C CE C C DIS 
| Come here to make all subsequent comparisons. | 
eee | 


CHAIN_COMP CHAINOP (AI,AJ) 2S (CHAIN_J) F (CHAIN_TI) 

Ge fe ee eg TIN A A Ee AN 

| The I-chain has won; Place I on the chain and update the | 

| last-element pointer. | 

| eg —————— ——————————— ————— E ——— — ——GÀ—À | 

CHAIN I P<K> = I 
K = 1 


E E EC M CCCII ee CEN | 
| Obtain next element from I chain and go back for a com- | 
l parison; if no more elements are left, fall through, | 
| concatenate the remainder of the J chain and return. | 
 e— —————— ——— —— —— ————— — —— À á— eee Hán ei 
CHAIN I1 I = P<ID> 

AI = AXI» :£S(CHAIN COMP) 

P<K> = J : (RETURN) 


A SA cu. 44 ee ae ee ee TTL Eg 
| The following code is analogous to the code above; J and I | 
| have been interchanged. | 


CHAIN_J P<K> = J 
K = J 

CHAIN.J1 J = PXJ> 
AJ = A<d> :S(CHAIN COMP) 
P<K> = I : (RETURN) 


PA A aa RE | 
| Entry point for MSORT: Obtain comparison expression. Then | 
{ allocate a permutation vector (P) and form a chain. | 
FA O a a | 


MSORT OP = IDENT(OP) 'LGT' 
OPSYN ('CHAINOP'! , OP) 
U = «PROTOTYPE (A) 
P = ARRAY (U) 
I = CHAIN(1,U) 


| npn ENIM D RCM MC CIIM EIFE MCCLIM MEM LE EP CI MIEL SCA MIC M E REC CX PEE 
| Convert chain by replacing in P<I> the value K where | 
( A<P<I>> is the Kth element of the sort. | 
| mo TEM Ne UP 


MSORT. 1 K = Ke+ 1 

SAVE = P<I> : F (MSORT_ 2) 

P<I> = K 

I = SAVE : (MSORT_1) 
SM ANA ee 
| We now have the inverse of a permutation vector. Invert | 


| it and return it. | 
AA no i e IN | 


MSORT_2 IP (P) 

MSORT = P : (RETURN) 
MSORT_END 
Names referenced Name Type Where defined 


by MSORT: IP Function Program 12.6 
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AREA A AA > ee ee 


Epiloque 


Merge sorting is quite fast. It not merely betters the figure 
of n 1loggn comparisons (but of course not less than log, n!) 
but will take advantage of any pre-ordering that exists in the 
data. Its popularity for sorting arrays has been inhibited by 
the necessity of allocating additional storage. 


E | 

(| Program |i A frequency sort on a string will return a 
i1 13.5 li string where the characters have been sorted 
(| FRSORT li on the basis of the number of occurrences in 
AA the string. Each character will appear at 
most once in the returned string. For example, 


FRSCPT ('MISSISSIPPI"') will return 'ISPM'. 


This is an example of a sorting application which makes use of 
a passive array Of information (the characters) while sorting 
on an array of numbers. It also serves to demonstrate the use 
Of MSORT. 


A A 
{ FRSORT(S) will do a frequency sort on the characters of | 
| the string S. The most frequent character will appear | 
| first in the string returned. | 
v — — ———— "are —— oon — ————— ——— —— J — — — —— € L—— | 
DEFINE('FRSORT(S) SC,C,N,I') < (FRSORT END) 
| NENNEN MEL MINE LM EI A | 
I| Entry point: Obtain in the array C the set of characters | 
| of which S is composed. Then allocate an array N to hold | 
| the number of occurrences in S of the corresponding | 


| characters of C. | 
p — 'Po— ——Á———————————————— H———————— II À—— CÓ QURE: | 


FRSORT C = CRACK(SKIM(S) ) 
N = ARRAY(PROTOTYPE (C) ) 
SEQ(* N<I> = COUNT(S,C<I>) ' , .I) 


AAA A A eR a A MC D ON ee he TP ey Gee CM Agere Be ee REED 
| Sort the indices of N and apply these indices to the array | 
{| C. Then convert the array to a string. | 
| ————————————————————————————————— ————————— A | 


FRSORT = STRINGOUT(AI (C, MSORT (N,' LT!) )) : (RETURN) 

FRSORT END 

Names referenced Name Type Where defined 

by FRSORT: SKIM Function Program 3.11 
COUNT Function Program 3.4 
AI Function Program 4.6 
MSORT Function Program 13.4 
STRINGOUT Function Program 4.2 


CRACK Function Program 4.1 
SEQ Function Program 4.3 


p TUE A ag ey ps pp ee T 

| £%£%% ELECTION SORTING | In selection sorting the least ele- 
t S r ment of the input aggregate is 
( $*9* | selected and is placed into the output aggregate. 
{ % | This element can be chosen in the straightforward 
| £448 | way of making one pass through the array to deter- 
ÁS mine the least element. When an element is chosen, 
its position can be filled with a special marker to avoid 
selecting that element in the future. To select the least 


element in this way requires n-1 comparisons and hence this 
form of selection sort requires a total of n(n-1) comparisons. 
This is unfortunately far more than the theoretical minimum of 
n logon. 


But selection sorting can be continually refined until this 


lower limit is approached. For example, the n items can be 
subdivided into SQRT(n) groups of SQRT(n) items each. Assume 
that for each group a least item is known. Then a selection 


consists of first selecting the least of these least items. 
Then only the selected candidate's group must be searched for 
a least item to recompose the original situation. This kind 
of selection will be called order-2 selection and requires 


comparisons for each item obtained. We may decompose our array 
into a group of groups of groups and so have order-3 selec- 
tion. Assuming each group has the same number of members (the 
cube root of n) then a selection would require 


1/3 
3(n -1) 


comparisons. For a level k hierarchy we would need 


1/k 
k (n -1) 


comparisons per item. This value monotonically decreases as k 
increases and so it pays to make k as large as possible. In 
the limit the hierarchy becomes a binary tree. The 'winner' 
of each subgroup 'plays' the 'winner' of the adjacent subgroup 
to determine the winner of the group, etc. This method of 
sorting has the suggestive name tournament sort. The number 
of levels k becomes logs n and plugging this value in for k we 
obtain 


log» n (2 - 1) = 1logegn 


comparisons per extraction which is close to the theoretical 
limit. 


Qo ng teen poe 

(| Program || TSORT stands for Tournament sort; it also 
E 13.6 | stands for Table sort since it can be used 
li TSORT N to sort tables as well as one- and two- 
———————X M ( dimensional arrays. The method by which 


tournament winners are recorded is by an auxiliary array of 
subscripts. Consider a typical tournament where the winner is 
decided by lexical ‘ordering (first in alphabetical order 


wins). The playoff of such a tournament is shown in Figure 
13.2. 
Array A 

Se oe 

i JIM [o4 

| | l— 2 — 

| BETH | 2 —3 { 

| | [l— 2 — 

| CHUCK | 3 — | | 

l l E — I 

| MAUREEN | 4 — | 

| I I— 2 

| KATE | 5 — | 

| | I— 6 — | 

| BILL | 6 — | | 

| | i— 6 —3 

| REE 17 — | 

| | (— 8 — 

| JOHN | 8 —y 

| ILÉ— — —J 


Fiqure 13.2 


Here, subscripts, rather than actual values, are used to 
denote players in the tournament. Assume that the number of 
players N in the tournament is a power of 2. Then the tourna- 
ment can be recorded in an array T of length 2 * N - 1. For 
example the above tournament is represented as: 


1 2 3 4 5 6 7 8 9 1011 12 13 14 15 


Playoff results Base of tournament 


Here the elements T<8> through T<15> (in general, T<N> through 
T<2 * N- 1>) hold the base of the tournament. The rest of 
array T is filled in as follows. To determine which subscript 
(of array A) should be placed into T<I>, a playoff is arranged 
between T<I * 2> and T<I * 2 + 1>. This method of recording 


the tournament is adopted from a tree-sorting algorithm by 
Floyd [19640], and can generally be used to encode a balanced 
binary tree. T<I> has sons T<I * 2» and T<I * 2 + 1> and has 
father T<I / 2>. 


The value found in T<1> is the subscript in A of the overall 
tournament winner. TO find the runner-up, the winner is 
'disqualified' by assigning a zero subscript into his original 


slot. This is found by adding N - 1 to the subscript in A. 
Thus if A<2> is the winner, T<2 +N - 1> is set to 0 to 
produce: 


1 2 3 4 5 6 7 8 9 1011 12 13 14 15 


Array T 12 2 6 2 3 6 8 1 0 3 4 5 6 7 8| 
A A A eT ee ee | 


A series of events is then run to resolve the outcome of games 
in which only he was involved. This is done as follows. The 
element T<9> was used in the battle to determine T<9 / 2> = 
T<4>. Hence we recompare T<2 * 4> and T<2 * 4 + 1>. The 
resulting element T<4> is used to compute the new entry in 
T<4 / 2» = T<2>. This proceeds for Logs N steps until T<1> is 
determined. In our example, this produces: 


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 


Array T | 6 3 6 1 3 6 8 1 0 3 4 5 6 7 8| 


The new winner, indicated by T<1>, is 6 which refers to 'BILI' 
in the original array A. This process is repeated until the 
winning index is a zero. 


TSORT(A,F,P) will use a tournament sort to sort the ele- 
ments of the array or table A according to predicate P. P 
may be absent in which case the assumed predicate is LGT. 
A may be singly-dimensioned in which case F, if nonnull, 
will indicate the field of a programmer-defined datatype 


on which the sort is based. A may also be a table or a 
doubly dimensioned array. In these cases, F may be an in- 
teger indicating the column on which to sort. If F is 


null, it is taken to be 1. The array A is not modified; a 
new array is allocated and returned. 
-——————— ——————————— ———————————————————ÉÓÁRREPPINNM | 


DEFINE('TSORT(A,F,P)I,J,X,N,TS,T,P I J,K,II,W') 


r---------- 


a a a a ee 
PLAYOFF (K) is a utility routine used by TSORT to determine 
the winner of T<K * 2> and T<K * 2 + 1> and to modify T<K> 
accordingly. It will fail if K is < 1. The array T con- 
| tains subscripts; some of these are 0 indicating open 
| slots. 
| nc DU E A E EA AE A RN | 
DEFINE ("PLAYOFF (K) ') : (PLAYOFF_END) 
PLAYOFF LT (K, 1) : S (FRETURN) 


s 
| 
| 
| 
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I = T<K * 2> <F(PLF_J) 

J = TK * 2+ 1> :F(PLF I) 

LE (I, 0) :S(PLF J) 

LE (J, 0) :S(PLF_1) 

EVAL(P I J) :S(PLF J) 

PLF I T<K> = I : (RETURN) 
PLF J T<K> = J : (RETURN) 


PLAYOFF_END 


Oe a ee A A eg ee ED IC LIN DEDI MNT GN C CM ee f 
| TS will compute a tournament size needed for N elements; | 


| i. e. the smallest power of 2 2 N. | 
A A A A O O SS A | 


DEFINE ('TS (N) *) : (TS_END) 
Ts TS = 1 
TS_1 TS = LT(TS,N) TS * 2 :S(TS. 1) F (RETURN) 
TS FND 


: (TSORT END) 
RS GREEK RA SCC KM NU qoM OCDE COMM ACER UR MER ITA 
| TSORT entry point: Compute the size of the tournament | 
| (TS). Allocate the tournament array (T) and the array to | 
I 


be returned. l 
AREAS AAA a a 


TSORT A = CONVERT (A,'ARRAY') 
TSORT = ARRAY (PROTOTYPE (A) ) 
N = PROTOTYPE (A) 
N  BREAK(',') . N :F(TSORT 1) 
F = IDENT(F) 1 
TSORT_1 TS = TS(N) 
T = ARRAY(TS - 1 + N) 


A EEES | 
| Initialize base of the tournament. l 
| E a A EENEN EEEE AA EE O A A EA AA O | 
TSORT_2 I = I+1 

T<TS - 1* I^ = I :S(TSORT 2) 
pa a IE C ML eG CIE IM C CIRC E EM AE NP a a da CU MMC CRI A 


| Obtain comparison expression. | 
AA ————— ——DP— ——— — a — — ee ÁN | 


P = IDENT(P) 'IGT' 
X = F '(ACID),' F '(A<J>)' 
X = IDENT (DATATYPE (F) ,'INTEGER') 
+ '"ACI,' F '>,ACJ," F >" 
PIJ = CONVERT(P '(' X ')', 'EXPRESSION!) 


PARA A Fe EN NCC CIC C ED MCCC CMM CMM CIE CI IM ee CINE CM CIE IE eee 
{ Now run a complete tournament determining an absolute win- | 
( ner (in T<1>). | 
A E A CN MUCH na CORREOS A RENA 


K = TS 
TSORT_3 K = K- 1 

PLAYOFF (K) :S(TSORT 3) 
| csi ERI EM DIN CDI GM MM M M ICD ICI CIC eS ee ee ee eg E IECUR CE 
( Transfer the winning structure to TSORT. For a one- | 
( dimensional array, this is simple. For a two-dimensional | 


| array, we must go through a loop. | 
| AAA EA E ARE SEE ee A | 
TSORT_4 II = II + 1 

W = T<1> 
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EQ (W, 0) : S (RETURN) 
TSORT<II DIFFER(DATATYPE(F),'INTEGER'!')^ = A<W> 
+ 2S (TSORT_7) 
J = 0 
TSORT_6 J = J+ 1 
TSORT<II,J> = A<W,J> 2S (TSORT_6) 


SAR O E A 
| 'Disqualify' the winner. Replay all matches in which he | 


| was involved. | 
PA ES | 


TSORT_7 K = TS- 14W 
T<K> = 0 
TSORT_5 K = KY/ 2 
PLAYOFF (K) :S(TSORT 5) F(TSORT_4) 
TSCRT END 
Epilogue 


The tournament sort as given uses a near minimum number of 
comparisons but unfortunately allocates two additional arrays. 
For sorting structures, strings or two-dimensional arrays, the 
additional allocation is probably not harmful since it will be 
small ccmpared to the storage already allocated. Minimum core 
sorting of arrays such as HSORT (Prog. 13.2) and Treesort 3 
(Floyd 1964] have the unfortunate property of inverting equal 
elements and this, we will see, can be bad for sorting arrays 
of structures. Other minimum storage sorting algorithms such 
as BSORT (Prog. 13.1) and one by Shell [1959] have the 
property of not being minimum time. There appears to be, at 
this writing, no minimum-core sorting algorithm (i.e. an in- 
place sort) which is minimum time and inversion free. 


| % NSERTION SORTING | In an insertion X sort the next 
| $ m available element to be sorted is 
I| £ | placed in the correct relative position in the output 
(9 | aggregate. This requires that the number of elements 
1 $ | in the output aggregate be adjustable and suggests the 
(QC use of a list, a string or a tree. A simple-minded 


insertion sort will compare the next item on the input list 
with each item in sequence on the output list until the cor- 
rect place is found at which point an insertion is made. This 
would require, on the average, n/4& comparisons for each inser- 
ted item. This is too many for large n. But for small n, 
where time is not an issue, this simple scheme has the advan- 
tage of providing a very simple sort. 


quU es EU TIS re 

(! Program {| SSORT(SS,S) is a string sort (or short sort 
li 13.7 E or simple sort). The string S is inserted 
E SSORT E into a string of strings (separated by com- 
——— mas) in SS. The augmented list is returned 
as value. For example, if the items in the input stream are 


being read in and are to be sorted one may execute 
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LOOP LIST = SSORT(LIST, TRIM(INPUT)) 2S (LOOP) 

If the input contained the names 'PAT', 'JOE', 'TOM' then the 
resulting LIST would contain ‘',JOE,PAT,TOM,'. Note that 
leading and trailing commas form part of the resulting string. 


DEFINE ('SSORT (SSORT,S) T!) 


SS PAT = ',' (BREAK(',') $ T *LGT(T,S) | RPOS(0)) . T 
: (SSORT. END) 
SSORT SSORT SS PAT = ',' S ',' T : S(RETURN) 
SSORT = ',' S ',!' : (RETURN) 

SSORT END 
Epiloque 
SSORT was written to ke as short and as convenient as 
possible. Its major failing is that it is slow. Not only is 
it a quadratic sort, but the data structure holding the sorted 
items is not the most conducive to high speed insertion. On 


the other hand, many if not most sort applications require 
only something ‘quick and dirty' and for such applications 
SSORT is recommended since it is not only easy to type but it 
saves on program space. 


[oM 

(| Program {ff The insertion sort, like the other sorts, 
E 13.8 E can be refined to the point where it becomes 
(| INSERT E a logarithmic sort. To find the correct 
y  _ IAAAAMMMXIL/¿/| position of the ith element we ought to com- 


pare it with the middle item. If it is > than this middle item 
it is compared with the middle item in the upper half, and so 
forth. Thus, to insert the ith item requires approximately 
logei comparisons. The total number becomes (approximately) 


log21 + log22 + ... + logan =  1loggn! 
which is the theoretical lower limit. 


This sounds attractive, but how does one find the middle ele- 
ment in each of these lists. The middle element of an array 
(or subsection of an array) can be easily computed but an ar- 
ray is not adjustable and its use would prove awkward in an 
insertion sort. That is, although the sort would prove 
logarithmic with respect to compares it would be quadratic 
with respect to moves. A list, on the other hand, is ad- 
justable and an element can easily be inserted within it, but 
the central element is not easily found. The solution is to 
use a tree as the receiving data aggregate. 


For example, assume that the following strings are to be 
inserted. 


NOW IS THE TIME FOR ALL GOOD MEN 


If these strings are inserted into a binary tree, the result 
is depicted in Figure 13.3. 


eS | 
-——1* Now *|—, 


i C 1 
| | 
v v 
v7 
c————|* IS *|————Àá4 | THE *|— ——ÀA 
| | | 
v v v 
| nmm m | Cc. .: Át 
——{* FOR *[——— | MEN | { TIME | 
| LL ———J | AS | ee | 
| | 
v v 
ARAN FAA 
| ALL | | GOOD | 
Ui LL LÁ 


Figure 13.3 


The first string is associated with the root node. The second 
string is lexicographically less than the first and so is as- 
sociated with the left branch of the binary tree. Each 
additional string is compared with the node and successive 
descendents until an opening in the tree is found at which 
point the string is inserted. A trace through the tree will 
readily indicate the nature of this process. 


A ru UE re e ELA EE MIU I 
| INSERT (T,S) will insert the string S into the tree T and | 
| return the modified tree. If T is nulla root node is | 
| created and returned. | 
o ——————————————— ——————— (—Á HDD NEP IN II | 
DEFINE('INSERT(T,S)V') 
A MCN DD P xi DE c C (C CCLECQ Fe Re CM MCN CEN CCCII aE ECC | 
| BTNODE is the datatype of a single node of a binary tree. | 
——^—^—^ 5»««»^———————————————— 'áBÁA——""«-"XY"u"x"«C""—-—!——X—— | 
DATA('BTNODE(VALUE,NO,LSON,RSON)')  :(INSERT END) 
E 
| Entry point: If T is null, return immediately with a fresh | 
{| node. Else we prepare to return T and go on to modify it. | 
| Get VALUE(T) out for fast and easy reference. If S equals | 
| value, increment count by 1 and return. | 
A E II IS A O STO SR NO AA | 
INSERT INSERT = IDENT(T) BTNODE(S,1) : S (RETURN) 
INSERT = T 
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V = VALUE(T) 
NO(T) = IDENT(S,V) NO(T) + 1 : S (RETURN) 


| xcv CMM I CC LM NX CAN SNC MC DLL RTT ES | 
{ If S > value, insert S into right half of tree; otherwise | 


{ into left half. | 
ee —— ———— ——— ——————À—————— eecóÍ—Á —v—!—— res | 


RSON(T) = LGT(S,V) INSERT(RSON(T),S) :S(RETURN) 
LSON(T) = INSERT(LSON(T), S) : (RETURN) 
INSERT END 
Epilogue 


Note that we do not create separate nodes for duplicate items 
but record a count ina field of the node. This saves on 
storage if the percentage of duplicate items is 20% or so. It 
also saves on compute time, especially if there are many 
duplicate items. For this reason, the binary insertion sort 
is ideal for preparing a word concordance which is a word- 
frequency analysis of a piece of text. 


Geese ee Se ee ree XN 


l1 Program N LINEARIZE(T) will linearize a binary tree 
E 13.9 | of the kind used in INSERT (Program 13.8). 
If LINEARIZE (| The tree will be strung via its right 
t_—_—__—_—________-____4 sons. The value returned will be the first 


node of the tree. If T is null, LINEARIZE will fail. 


DEFINE('LINEARIZE (T) ') : (LINEARIZE END) 
q: A MDC MCCC MC DM CAE CMEEe. 
{ Entry point: f 
AA luu LL C I IIA A E O AS RCS 


LINEARIZE IDENT (T) : S (FRETURN) 


E 
| Linearize the left side and attach on node T (LAST NAME is | 
(a global variable set to equal the name of the last link | 
| on the chain). | 
One Se A NS, | 


LINEARIZE = IDENT(LSON(T)) T :S (LIN. 1) 
LINEARIZE =  LINFARIZE(LSON(T)) 
$LAST NAME = T 


| ecu C A a CECI CE GEM NO CMM DE LEE MEL LAC PM II MMC | 
| Now linearize the right-hand side. t 
———— —————— MP A Ue UD XN DP UM RU MP oque REIN | 
LIN 1 RSON(T) = LINEARIZE (RSON (T) ) : S (RETURN) 

LAST NAME = .RSON(T) : (RETURN) 
LINEARIZE END 


qo URSUS EDT E 

if Program |! With some sorting procedures, an almost- 
1 ( 13.10 ui sorted input will serve to decrease sorting 
(| INSERTB |! time. The speedup is most pronounced with 
— > the bubble sort but pre-ordering will 


favorably affect the merge and Hoare sort as well. With the 
tree insertion sort we have the reverse phenomenon. If the 
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> ee REI) ee a TD 


elements inserted are already in alphabetic order the number 
of comparisons to insert the Ith element is I-1, the worst 
case. The logarithmic sort becomes a quadratic sort. Per- 
versely, if the elements are initially in reverse alphabetic 
order, we also achieve the worst case of I-1 comparisons for 
the Ith element. 


But the insertion sort can be modified slightly to not only 
avoid the inefficiences of almost-ordered data but to actually 
take advantage of any ordering that exists. The trick is to 
grow the tree backward! that is, the last node to be inserted 
should become the root of the tree. 


For example, if the sequence of strings is 
NOW IS THE TIME FOR ALL GOOD 


the tree grown backward becomes as shown in Figure 13.4. A 
rough rule for growing the tree backward is the following. 
Draw an imaginary line down the middle of the tree separating 
all nodes < the new root from all nodes > than it. Any path 
broken by such a line should be ‘short circuited! so that all 
pointers from any node are directed to nodes in the same half 
of the tree. As an example, the result of adding the string 
'MEN' to the diagram in Figure 13.4 is shown in Figure 13.5. 


v v 
| ALL *[————— c——————|* TIME 
-————' | | AAA 
v v 
Se p 
{ FOR | r] * THE | 
A” I AAA 
v 
A E, 
— i 
v 
SSS 
| NOW | 
Eur og 
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ALL *|— ———4 ( c————-|* TIME 
v | v 
qu | l oaeee 
| FOR | | r—— ——|* THE | 
M 
pu I 
| IS | | 
— | | 
v 
oe 
| Now | 
LlIl———— — 
Figure 13.5 


ode c C E COM CC NM ECCE CMM M MIN S A a ee ee 
| INSERTB(T,S) will insert the string S into the backward- | 
| growing binary tree T. The root of the returned tree will | 
f contain S. | 
| ———— ——————————— —————————————— —————  ÓH— P —!— | 

DEFINE ('INSERTB (T,S)V!) 

DATA ('BTNODE (VALUE,NO,LSON,RSON) ') 

: (INSERTB_END) 

SN A | 
| Entry point: The first part is similar to INSERT. Com- | 


| ments there are appropriate here. | 
—————————————— —— ——————— HH PU" —!Hám: 


INSERTB INSERTE =  IDENT(T) BTNODE(S,1) : S (RETURN) 
V =  VALUE(T) 
NO(T) = IDENT(S,V) NO(T) + 1 : S (RETURN) 
e 


O A 
| If S > value, insert S into the right half of the tree. | 
| The root node of the returned tree will have a VALUE of S | 
{ and will become the root node of the tree we will be | 


| returning. l 
a OREN EAE TETERE IE HEN RR NE BRA EE I ERE ee AREE 
LGT (S, V) :F(INSERTB L) 
INSERTB = INSERTB(RSON(T), S) 


ee ml. ef ee E 
( Include the rest of T under the left side of this new | 
| root. | 
AA A A o E NER ER ES REESE 
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RSON(T) = LSON(INSERTB) 
LSON(INSERTB) = T : (RETURN) 
ee Se ar es ee AAA DEM Ne ee aie ve Nae ee 
| Do an analogous thing for the opposite side. | 
Loc —— Á | 


INSERTB L INSERTB =  INSERTB(LSON(T), S) 
LSON(T) = RSON(INSERTB) 
RSON(INSERTB) = T : (RETURN) 


INSERTB_END 
ucc RICE MC eee eee ee A 


| SKK ISTRIBUTIVE SORTS | So far, every sort we've presen- 
IS £ ted was a comparative sort. There 
1 $f % Y are other kinds, however, and these we can all lump 
(89 & | together in a category called distributive. Ina 
| $$9$ | distributive sort, each item to be sorted is placed 


A in a position with respect to the other items ac- 
cording to some parameter of that item. This has the attrac- 
tive feature of not being binary and thereby one can better 
the n loggen limitation. For example, if one is sorting real 
numbers, uniformly distributed between 0 and 1, an excellent 
technique is to begin distributing the items one at a time in- 
to the receiving array in approximately their final position 
depending only on their value. Unless one is lucky, collisions 
will begin to occur as the receiving array is filling up, but 
the time to patch up such discrepancies is assumed to be small 
compared with the time saved by the almost-one-pass nature of 
the sort. The effectiveness of such a sort is highly data 
dependent, however, and for this reason is not very popular. 


A more familiar distributive sort is the radix sort. This is 


the sort used on mechanical sorters which distribute cards in- 


to bins. Assuming n cards are to be sorted on a field con- 
taining k characters, a distribution over the least 
significant character is made first. The clumps are gathered 


together and passed through the machine again, this time on 
the next least significant character. After k passes, the en- 
tire deck is sorted. the number of operations is nk rather 
than n 1loggn because each operation involves pitching a card 
into one of several bins and such an operation yields more in- 
formation than a binary choice. 


We do not have space to describe a SNOBOLU rendition of the 
radix sort but happily refer the reader to the original SNOBOL 
article [Farber, et al 1964] where it appeared as an example. 


(ftt cae ES ERE SE E ALE A a a AE AA A SE EN E OE MAC TO 0020207200212 2271 1T 22220002203 
ee eee eee lite ITTI??? "EXERCISES * 22270 222 e222 IITTI? 


eeeeees#geesgsseseeteee#esrseeeerteeeteéee#sr#eteeeoeettfmhiegeeeteet .. 99 * oe 9€ 


SS a ee ee UE 

| Exercise 13.1 | What two instructions constitute the inner 
t—-————— loop of BSORT? Can the reader recommend a 
Slightly faster version? 
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Ge DC E EPT | 
| Exercise 13.2 | Prove that in HSORT the value of K when the 


AAA recursive call HSORT(A,I,K) is made is al- 
ways less than N thereby removing the possibility of an 
infinite loop. 


Wo Ne pe ae Oe 

| Exercise 13.3 | Write a non-recursive version of  HSORT 
LL————————-—————-4 using PUSH and POP (Programs 5.5 and 5.6). 
Hint: This can be done by modifying 2 go-to fields and adding 
5 very simple instructions in place of the 2 recursive calls. 


£7 77 ———ÁÍá1 
| Exercise 13.4 | Given 3 items to sort, what is the average 


AA 3 number of comparisons required by BSORT and 
by HSORT. Note, as a consequence, that BSORT will actually be 
faster than HSORT for small arrays. Estimate the crossover 
point at which the number of comparisons are the same. Then 
modify HSORT so that it calls BSORT for arrays smaller than 


this. (The estimate may be made on analytical or empirical 
grounds.) 

(^7 ———Á—ÓÓ 

| Exercise 13.5 | The elements of an array A are to be sorted 


AA numerically in ascending sequence but all 
numbers within a certain range R of each other are to be 
regarded as numerically equal and are to retain their relative 
ordering. Using MSORT, define an appropriate predicate and 
sort A accordingly. 


(rp 
| Exercise 13.6 | Assume we wish to sort an array of strings, 
A MIé€>7] TS BÀ, alphabetically as defined by the 


predicate AGT (Prog. 3.13). We could call MSORT(A,  'AGT'). 
What is a more efficient procedure? 


[oe eee 

| Exercise 13.7 | Both MSORT(A, 'LT') and MSORT(A, 'LE'!) can 
AAA be used to sort A in decreasing numerical 
order. The difference between the two is in the way equal 


elements are treated. Which should be used so that the rela- 
tive order of equal items is retained. 


>... pe ee ee 
( Exercise 13.8 | SSORT can be speeded up considerably by the 


t following technique. Represent a binary 
tree as a string by the following method. The null string is 
the null tree. A tree with root R is represented as: 


(LSON) R (RSON) 


where LSON is the string representation of the left son of the 
tree and RSON is the representation of the right son. Then 
BAL can be used to rapidly scan for an insertion point. A tree 
is built up much in the manner of INSERT. Rewrite SSORT so 
that the string returned is this tree. 
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IS | 
| Exercise 13.9 | The body of SSORT (Prog. 13.7) need only be 
3 one statement. Modify the pattern SS_PAT 


so that the :S (RETURN) can be changed to :(RETURN) and the 
second statement deleted entirely. 


Ge ere ae ey LU S eu 
| Exercise 13.10 | One can enhance the speed of INSERT by 
t-————— periodically balancing the tree. Write a 


function TREEBAL(N) which will balance a tree beginning at 
node N and return the root of the balanced tree. The use of 
LINEARIZE to write this function is optional. 


er re a Pe 

| Exercise 13.11 | Modify LINEARIZE so that the LSON fields 
LAA are cleared. 

Ge. ——— 1" 

{| Exercise 13.12 | Modify LINEARIZE so that it counts the 
AAA number of nodes in the tree, Assume some 


global variable exists (say N) which is initially 0. 


CSS MEC M ee 
| Exercise 13.13 | The average number of comparisons of a 


AS logarithmic insertion sort was estimated 
in the text to be logan! This average would be achievable by 
INSERT only if the tree is always kept perfectly balanced. But 
for random data this will not be the case and the expected 
degree of unbalance can be computed. 


a) Determine the average number of comparisons required by 
the tree-insertion sort. Assume that every input permutation 
is equally likely and that no two items are identical. 


b) As n approaches infinity, what is the ratio between this 
number an n logan. 
[LETS ee 
| Exercise 13.14 | What does the tree resemble when the fol- 


AÑ lowing strings are placed into a) INSERT 
and b) INSERTB? 


A QUICK BROWN FOX JUMPED OVER THE LAZY DOG 
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unorthodox. In conventional languages, a function (or 

its equivalent) is defined at compile time. Thus, its 

entry point, number and type of arguments,  tem- 

us poraries, etc. are fixed for the duration of the 

program. In SNOBOLU, these are governed by arguments to the 

DEFINE function. Since these arguments can be the product of 

an arbitrary computation, and since the DEFINE function can be 

called at any time, the function-defining facility is extra- 

ordinarily flexible. This section shows several examples of 

how this flexibility can be harnessed to produce more ef- 
ficient, better structured and more powerful programs. 


tp he function definition facility in SNOBOL4 is somewhat 
| 
li 
E 


RR AN 

Ii Program |i DEXP (proto) permits functions to be easily 
6! 14.1 li defined in terms of simple, one-line expres- 
BE {| sions. For example: 

LAS 


DEXP ('AVE (X,Y) = (X + Y) / 2.0%) 


will define the function AVE(X,Y) to be equal to half the sum 
of X and Y. It thus mimics the Fortran arithmetic function 
facility. It is, however, much more powerful, since any se- 
quence of statements separated by semicolons may be used to 
specify a function. In fact, arbitrary functions may be 
defined in this way. 


DEFINE ('DEXP (PROTO) NAME, ARGS'!) : (DEXP_END) 


| Entry point: First remove leading blanks, just in case. | 
| Next obtain the name of the new function (NAME) and its | 
| argument list (ARGS), removing the latter. | 
| Merc CRT ——— — — —————— ——— '—— | 
DEXP PROTO  POS(0) SPAN(' ') = 

PROTO BREAK('(') . NAME BAL . ARGS = NAME 


re ae ae ee ge MGE SMELL E MM REN ey en eer eee | 
| Create code which will be the body of the new function. | 
| Then DEFINE it. I 
ec m Dr OE E —— M— n — — ——— A —————— o———— ER | 


CODE (NAME ' ' PROTO ' <S(RETURN) F (FRETURN) *) 
DEFINE (NAME ARGS) : (RETURN) 
DEXP_END 
Epiloque 


Care must be taken in the use of DEXP. If the last statement 
of a sequence fails, the entire function might inadvertently 
fail. This can be cured ky placing a semi-colon after the last 
statement (null statements always succeed). For example, we 
can define SIGN(X) which returns +1 if X > 1 and -1 if X < 1 
and null if X = 0 as: 


DEXP ('SIGN(X) = GT(X,0) 1 ; SIGN = LT(X,0) -1 ;') 
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AAA 

{{ Program {| One of the most frequent requests that 
if 14.2 [KM SNOBOL4 users make is for more space. If 
(| DEXTERN If lack of main storage is due to the size of 
AS the program, then this next function, or 
some variant of it, can be used to obtain more core. The 


function DEXTERN (Define EXTERNal function) will allow for the 
dynamic loading of SNOBOL4-coded functions. The arguments to 
DEXTERN (proto, label) are identical to those of the built-in 
DEFINE function. DEXTERN will create a small provisional 
function body for each such function. This will cause the 
first call on that function to result in the function being 
loaded from an external file, compiled and executed. Subse- 
quent calls go straight into execution with no overhead. 


DEFINE ('DEXTERN (PROTO, LBL) NAME') 
DEFINE (' LOADEX (LEL) PAT, X, CODE!) 
LIB. = Some Library File Designator : (DEXTERN_END) 


| NN RD DESI GU aE ESE SSIS RSS 1D C GEE RAGE E | 
| Entry point for DEXTERN. Determine the label (LBL) and | 
| compile code which serves as the function body until the | 
Į first call. Then define the function. | 
| Rp a e "CD ———Üns— À———ssnü———- 
DEXTERN PROTO IDENT(LPL) BREAK('(') . LBL 
CODE (LBL "  LOADEX('" LBL "!) ; :(" LBL "y" ) 
DEFINE (PROTO, LBL) : (RETURN) 


NS E E | 
| Entry point for LOADEX(LBL).  LOADEX will load an external | 
| segment of code beginning with label LBL and ending with | 
| LBL_END. I 
a ce ———— — —— ———————— | 
LOADEX REWIND (LIB. ) 

INPUT(.LIB FILE, LIB ) 


a a EM DIRE m CM MM IDEE ECCE O ge CE ID MU ge Ce eer ee ee DEAE 
| Loop to look for function | 
— p ——— Áe— — — —— ————' ey 


PAT = POS(0) LBL (* ' | RPOS(0)) 
LOADEX 1 CODE = LIB FILE :F (ERROR) 
CODE PAT : F (LOADEX, 1) 


a E D CLOUD E ee eR ae ee 
| Loop to process statements. Note conventional continuation | 
| and comment characters. | 
— ————— — —————————— ee O | 


PAT = POS(0) LBL ' END' (* ' | RPOS(0)) 
LOADEX 2 X = LIB FILE :F(LOADEX 3) 
X PAT :S(LOADEX 3) 
X POS(0) ANY('*-') :S(LOADEX 2) 
X = ‘fst X 
X POS (0) re) ANY('.+') = ' 
CODE = CODE X 2 (LOADEX 2) 


[Tm A 
( Now code it up and return. | 
SEEN TUIS A Re aN RR Te SE ae aR ol 
LOADEX_3 CODE (CODE) < (RETURN) 
DEXTERN_END 


Epilogue 


One reason for the DEXTERN function is convenience. 
Frequently-used subroutines need not be copied into a given 
program but may be kept in a file which serves as a library. 
In this way several programs may share a common library and 
may be assured of up-to-date copies. 


Another reason for DEXTERN is that it permits the running of 
many large programs which would otherwise not fit into core. 
Most large programs have significant portions that are  infre- 
quently used and it is extremely rare to encounter an applica- 
tion which requires all the facilities of the large program. 


The text processing system used to write this book is a good 
example of this. There are approximately 1200 statements in 
the main program and approximately 1500 in an external 
library. Each chapter of the book may be processed within 
prime-shift limits since no chapter uses all the facilities of 
the text processor. However, the entire book requires an 
evening run. 


It is not necessary to dynamically load source programs on a 
per-function basis. See Exercise 14.5. 


CS IQ MMC ae ee 

II Program || One advantage of decomposing a large program 
11 14.3 11 into functions is that the values passed to 
(|  FTRACE li a function and the value returned can be 
E meee easily monitored by means of the &FTRACE 
switch. Unfortunately, only strings, reals and integers are 


printed explicitly. Other data objects such as patterns, ar- 
rays, tables, etc. result in only the datatype being printed 
(with possibly an identification number as in SITBOL). This 
deficiency can be corrected by the programmer, however, by 
using the available trace facilities. In particular 


TRACE( NAME, ‘CALL', , FNAME) 


will cause the function named FNAME to be invoked when the 
function named NAME is called. FNAME can determine sufficient 
information about the called function (such as its arguments 
via the ARGS function) to produce an elaborate display of any 
aggregate passed as argument. The second argument to TRACE 
can be the string ‘RETURN! which can enable a similar function 
to display the returned value. 


One weakness of the scheme is that unlike the &FTRACE switch 
which affects all function calls, the TRACE function requires 
two explicit calls for each function traced. The FTRACE func- 
tion defined here is designed to automate this process. It is 
Simply placed once in the program before all functions which 
are to be traced. FTRACE will redefine the DEFINE function 
and thereby sieze control at each function definition. The 


utm» EE CE O EE ce AA AE AO A EE EP a A ES AA A E A AO CEP MA AA AO O. AS AA Ge GE CE O ANETO ATA T AED EAS GS A A ee A. 


functions actually called to do the tracing (FTR_CALL and 
FTR_TRC) are left as exercises. 


DEFINE ('FTRACE (PROTO, LABEL) NAME!) 
OPSYN('DEFINE.', 'DEFINE!) 

OPSYN('DEFINE!, 'FTRACE!) 

STRACE = 10000 : (FTRACE END) 


WU — IXDCUDTTUCPSCT LIT A A M ORE EIN T ITE ES rra ru aH n reU EN I EE Ee AAN 
| Entry point: Define the function, issue the trace requests | 
| and return. | 
i a ea a aa ae ——— ——XÓ 
FTRACE DEFINE. (PROTO, LABEL) 

PROTO BREAK('(') . NAME 

TRACE (NAME, 'CALL', , 'FTR CALL') 

TRACE (NAME, 'RETURN', , 'FTR RET!) : (RETURN) 
FTRACE_END 


DEAN ee ee 


(| Program E This routine can protect other routines 
E 14.4 E from possible malfunction owing to an unan- 
(|! INSULATE |l ticipated modification of some global 
an ad variable or keyword. As written, protection 


from modification of the &ANCHOR keyword is obtained, but this 
protection could be extended to include other keywords and 
glokal variables as well. 


While it is held in these pages that modification of the 
&ANCHOR keyword is seldom warranted and is inconsistent with a 
general functional scheme of decomposing and structuring a 
large program, it is nonetheless true that occasionally one 
encounters two separately written sections of code that in- 
teract with each other and that depend on opposite values for 
the &ANCHOR keyword. For example, if routines in this book 
were called from a main program which assumed anchored mode, 
then pandemonium would be the general result. 


TO rectify the situation short of recoding one or the other of 
the two ill-fitting sections one may insert the INSULATE 
function. 


es AE ee eS A M CE SEC M RE EDEN 
| INSULATE will cause each function following it to trap to | 
| INS CALL() when called and to INS RET() on return. This | 
| requires redefining DEFINE to point to INSULATE. | 
m CLR S TR MUR E NU A A A MUS | 

DEFINE ('INSULATE (PROTO, LABEL) NAME!) 

DEFINE ('INS, CALL () *) 

DEFINE ('INS_RET ()*) 

OPSYN('DEFINE.', 'DEFINE'!) 

OPSYN ('DEFINE', ' INSULATE!) 

STRACE = 100000 : (INSULATE_END) 


A ne eR, ee ge ee MIA CE TA | 
| Entry point for INSULATE. Define the function and set up | 
( tracing. | 
A E A A EEEE 
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INSULATE PROTO |BREAK('(') . NAME 

DEFINE. (PROTO, LABEL) 

TRACE (NAME, 'CALL',, ‘INS CALL!) 

TRACE (NAME, ‘RETURN! ,, 'INS RET!) : (RETURN) 
a ae a a Se ee ee ee pee ee eg ge ag ee Se ee ee ee ae 
| The two routines. | 
¡A CTI E rc TÉ O A A E A | 
INS CALL PUSH (&ANCHOR) - ANCHOR = 0 : (RETURN) 
INS RET | &ANCHOR =  POP() : (RETURN) 


INSULATE END 


Names referenced Name Type Where defined 
by INSULATE: PUSH Function Program 5.5 

POP Function Program 5.6 
Epilogue 


Note that when a routine is called and INS CALL gains control 
it calls the routine POP(). If tracing were on, at this point, 
POP would presumably be traced sending control to INS CALL 
again; an infinite loop would be the sad result. But the 
&£TRACE switch is conveniently turned off at this point and 
restored on return. As Dickman and Jensen (the original im- 
plementors of the SNOBOLU trace facility) put it, the 'stout 
of heart! can turn tracing on after the function receives 
control. 


| ASSI a MCI ee 


(| Program E SNOBOIU has the ability to redefine built- 
E 14.5 | in operators and functions. Thus we may 
B REDEFINE |! write 

La | 


OPSYN (!*', !*',2) 


indicating that the binary operator '*' is made equivalent to 
binary '*'. All additions thereafter become multiplications. 
OPSYN can be used for named functions as well as operators and 
user-defined functions as well as built-ins. 


While the basic facility exists, we are here concerned with 
its proper and effective use as a programming tool. Undoub- 
tedly it has already occurred to the reader that he can play 
'fool the counselor' with an OPSYN as above. Let us assume, 
however, that we are above such pranks. A semi-legitimate use 
of redefining an existing facility is as follows. Being un- 
familiar with the language, and in particular unaware of the 
built-in function REPLACE, a programmer writes a user-defined 
function REPLACE as part of a larger program. Subsequently he 
learns of this built-in facility and wants to use it. He may 
write 
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before defining REPLACE and use REP () to obtain the built-in 
facility. 


This use is only semi-legitimate for if the program is to have 
a long life, he would be better off redefining his original 
function, even if more painful, than in redefining a built-in. 


Redefining a built-in is normally only justifiable as a design 
objective if one is writing a facility designed to be upward 
compatible with an existing one. For example, one may redefine 
the operator '+' to sum arrays, complex numbers or physical 
quantities but in that case it should treat conventional ob- 
jects (integers, reals, strings) as it did prior to the 
redefinition. 


REDEFINE (OP, PROTO, LABEL) is intended to make such upward com- 
patible extensions. The first argument is an operator to be 
redefined, or, if a function is redefined the first argument 
is null. The name of this function can be taken from the 
second argument which is the function prototype normally given 
to DEFINE. 


DEFINE (' REDEFINE (OP, DEF, LBL) NAME,N, FLAG!) 
: (REDEFINE_ END) 


Gye ay ee ee Pe eT ECCE I ate A Oe TEs Seger N 
{ Entry point: Extract the function's name (NAME) and deter- | 
| mine the number of arguments (N = 1 or 2). | 
ARIS AP ESA A II E E 
REDEFINE DEF BREAK('(') . NAME '(' BREAK('),') LEN(1) . FLAG 

N 

N 


1 
IDENT(FLAG, ',') 2 


| ee CDM MMC E me Sheers ge fe ee ere EN ee ee EE ES 
| But if the first argument is null, we are not talking | 
| about an operator (OP) at all but a named function. | 
————Á———————— —————  ———  Pr— HÀ — a ——————————— Y | 

N = IDENT(OP) 

OP = IDENT(OP) NAME 

OPSYN(NAME '.', OP, N) 

DEFINE (DEF, LBL) 

OPSYN(OP, NAME, N) : (RETURN) 
REDEFINE END 


Epilogue 


In order to avoid defining away the built-in facility ir- 
retrievably, REDEFINE will OPSYN to it a created name formed 
by appending a period to the function's name. For example, 


REDEFINE ('+', 'SUM(X, Y) I!) 


will cause SUM. () to be defined and equivalenced to the old 
binary + while binary + will now be equivalenced to SUM(). 


REDEFINE can substantially simplify the task of extending a 
range of built-in operators. This is best illustrated by ex- 
ample as in the next program. 
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Co eg rig | 


(i! Program (i To illustrate the redefinition facility and 
li 14.6 E to create a possibly useful extension to 
{| PHYSICAL (Ii SNOBOL4Y we will define the four fundamental 
1 operators of arithmetic to operate on 
'physical' quantities. For example, a quantity such as four 
meters divided by a quantity such as two seconds produces a 


Speed of two meters-per-second. Normally, physical quantities 
are represented by some combination of units of length, mass, 
time and charge. We will illustrate our system with the near- 
standard MKS system (Meters-Kilograms-Seconds-Coulombs) but it 
should be obvious that any other system can be employed. 
Indeed, the subroutines, as written, depend in no way on our 
particular universe; any type and number of physical quan- 
tities may be employed (up to the size of SALPHABET). 


Physical quantities will be represented by a 
defined datatype defined as 


programmer- 


DATA (' PHYS (VAL, NUM, DEN) ') 


where VAL is the numerical value, NUM is the numerator of the 
units field and DEN is the denominator. Units are represented 
by single letters. For example, 3.5 meters/second2 may be 
represented as: 


PHYS(3.5, 'M', 'SS!) 


DATA (' PHYS (VAL, NUM,DEN) ') 


SSS SS a 
| The following operators and one function are redefined. | 
ec —— ID tc PU ERR E M o cr CE | 

REDEFINE ('-', 'MINUS (X) *) 

REDEFINE('*', 'SUM(X,Y) ') 

REDEFINE('-', 'DIFF(X, Y) ') 

REDEFINE('*', 'MULT (X,Y) ') 

REDEFINE ('/', 'DIV (X,Y) ') 

REDEFINE( , 'EQ(X,Y) ') 


| NORM(X) will normalize a physical quantity, meaning that | 
{ we obtain a unique specification for comparison purposes. | 
| This is done by sorting the physical units and canceling | 


| common factors across the division bar. | 
nn A | 


DEFINE (' NORM (X) C*) : (NORM_END) 
NORM X = DIFFER(DATATYPE(X), 'PHYS')  PHYS(X) 
NORM = X 
DEN(X) = ORDER (DEN (X)) 
NUM(X) = ORDER (NUM (X) ) 
NORM_1 IDENT (DEN(X) ) : S (RETURN) 
NUM(X)  ANY(DEN(X)) . C = :F (RETURN) 
DEN(X) C = : (NORM_ 1) 
NORM, END 
(| XY() will normalize the two arguments of an arithmetic | 


( operation (assumed to be X and Y). As an added bonus, XY() | 
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| will succeed if neither argument is a physical quantity | 


( (the old operation can be applied). l 
AAA E E EREE E E EO E E | 


DEFINE (' XY () ') : (XY. END) 
XY (DIFFER (DATATYPE (X), 'PHYS') 
+ DIFFER (DATATYPE (Y), 'PHYS!)) :S (RETURN) 
X = NORM(X) ; Y = NORM(Y) : (FRETURN) 
XY END : (PHYSICAL END) 


E E NC KM EC LEM ee eee ee ee IL CD a C cc —pEC dM EN RNC MES S 
| The definitions of the separate functions are now greatly | 
( simplified because of the utilities written above | 
RENTUR A EDEN CIC EMG ———— | 


MINUS MINUS = XY() MINUS. (X) : S (RETURN) 
MINUS = PHYS (-VAL (X) , NUM (X) , DEN (X) ) : (RETURN) 
SUM SUM = XY() SUM.(X,Y) : S (RETURN) 
SUM = PHYS(VAL(X) + VAL(Y), NUM(X), DEN(X)) : (RETURN) 
DIFF DIFF = X *-Y : (RETURN) 
MULT MULT = XY() MULT. (X,Y) : S (RETURN) 
MULT =  PHYS(VAL(X) * VAL(Y), NUM(X) NUM(Y), 
* DEN(X) DEN(Y)) < (RETURN) 
DIV DIV = XY() DIV. (X,Y) : S (RETURN) 
DIV =  PHYS(VAL(X) / VAL(Y), NUM(X) DEN(Y), 
* DEN(X) NUM(Y)) : (RETURN) 
EQ XY () :F(EQ 1) 
EQ. (X, Y) :S (RETURN) F (FRETURN) 
EQ 1 (EQ(VAL(X),VAL(Y)) IDENT(NUM(X) , NUM (Y) ) 
+ IDENT (DEN (X) , DEN (Y) ) ) 2 S (RETURN) F (FRETURN) 
PHYSICAL_FND 
Names_referenced Name Type Where defined 
by PHYSICAL: REDEFINE * Function Program 14.5 
ORDER Function Program 3.1 


* indicates name is referenced in the initialization section. 


Epiloque 


As an example of the use of physical arithmetic, we may 
assign: 


MET. = PHYS(1, 'M') 
SEC. = PHYS(1, 'S') 
KG. = PHYS(1, 'K') 


and from now on we need not so much as employ the PHYS() func- 
tional form as it will be called implicitly. Thus a Newton is 
a Met.?/Sec.? so we write: 


NEWT. = (MET. * MET.) / (SEC. * SEC.) 
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and a Joule is a Newton-Meter: 
JL. = NEWT. * MET. 


Though we are using an MKS system as a base for our physical 
quantities, we can specify any given problem and perform all 
calculations in thoroughly colloquial units. For example, we 
can express foot, mile and acre as: 


IN. = MET. / 39.4 

FT. = 12 * IN. 

MI. = 5280 * FT. 

ACRE = (MI. * MI.) / 640 


We may then express computations entirely in the new units. 
For example, to print the acreage of a plot of ground 200! by 
250! we write: 


OUTPUT = VAL(200 * FT. * 250 * FT. / ACRE) ' ACRES' 


We may even dispense with the asterisk between 200 and FT. but 
this is left as an exercise. 


4¥88 o-routines and state functions | The notion of co- 
% r  xoutine is of in- 
% | terest from several standpoints. In theoretical 
% ( circles, it is as worshiped a programming practice 
| as the goto is deplored. However, this theoretical 
L.————3 enthusiasm does not carry over to the practical 
world. Practical programmers shun co-routines to a greater 
extent than they embrace goto's. Nonetheless, techniques for 
the construction of well-formed programs are not very well 
developed nor understood at this writing and study of the co- 
routine protocol is warranted merely for the light it can shed 
on this other, more general, issue. 


As remarked by Knuth (Vol. 1, p. 191], small examples of co- 
routines do not seem to exist and so we must construct a 
somewhat elaborate situation merely to demonstrate what it is. 
The best example seems to be one furnished by a compiler. As 
we have discussed previously (Chapter 11), a compiler is fre- 
quently decomposed into lexical analysis and syntactic 
analysis. The purpose of lexical analysis is to decompose a 
string into a sequence of discrete non-decomposible objects 
frequently represented by pointers into a symbol table. Thus, 
the portion of SNOBOLU program: 


(ALPHA + BETA GAMMA) 


will be analyzed by the lexical analyzer into seven compo- 
nents, i.e., left parenthesis, ALPHA, binary plus, BETA, 
binary blank, GAMMA and right parenthesis. It may be seen from 
this example that the output of the lexical analyzer is not 
determined completely from the characters which appear before 
it on the input stream but is also based on characters which 
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have previously been processed. Thus, if the last token passed 
back had been a binary operator, then a blank preceding an 
identifier (such as BETA) is ignored, but if the last token 
had been an identifier (or constant, right parenthesis, etc.) 
then the blank preceding another identifier is interpreted as 
an operator. 


The lexical analyzer can most naturally be described by state 
transitions. For example, after having processed a left paren- 
thesis, the lexical analyzer is in the same state as after it 
has processed a binary operator. Also, after having processed 
a right parenthesis it is in the same state it is in when it 
has processed an identifier. Though this simple example only 
depicts two such states there are in fact several others. 


States are most naturally represented by a location within the 
program which is currently being executed. Now this presents 
an anomaly if, as frequently happens, the syntactic analyzer 
calls the lexical analyzer for each token. This is because 
called functions do not normally 'remember'! their state but 
rather begin each computation afresh from some fixed entry 
point. 


We may at this point wonder if we had not got things backward. 
Maybe the lexical analyzer should call the syntactic analyzer 
each time it wants to dispose of one of its tokens. But then 
the shoe is on the other foot. The state of the syntactic 
analyzer is also best recorded by means of a location. 


This dilemma is resolved by a co-routine linkage. The jump- 
and-set-link instruction, common in most machines, can jump to 
a location and simultaneously set a register to the current 
location. By means of this instruction the lexical analyzer, 
when it wishes to return to the syntactic analyzer, can jump 
to a common return point which can save the contents of this 
register and use this as the start up point when the lexical 
analyzer is reentered. From the point of view of the lexical 
analyzer, it is like calling the syntactic analyzer. Actually, 
a little section of code is needed to make it seem as though 
each is calling the other in an entirely symmetric way. 


We may at this point step back and wonder why the need for co- 
routines is not felt more frequently than it is. Certainly it 
cannot be the inappropriateness of modeling computational 
behavior by state transitions as this is very common. The 
answer must lie in the fact that few functions require shifts 
in entry point to operate effectively. A shift in entry point 
implies that the next computation will depend on the ones 
which went before; that is, the function is non-homomorphic. * 


Non-homomorphic transformations are frequently homomorphic if 
the units are made large enough. Thus, lexical analysis, when 


*Recall from Chapter 3 that a homomorphic string transforma- 
tion T is one such that T(S, Sg) = T(S,) T(S5). 
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considered on a token basis, is non-homomorphic but is 
homomorphic on a per-statement basis. This is, in fact, one 
of the advantages of a string language (or a list language). 
Entire sequences may be ported across functional boundaries 
which may then be aligned with the natural decomposition of a 
problem into homomorphic transformations. 


Such decompositions alone, however, are not sufficient, neces- 
sarily, to reduce the complexity of large practical problems 
simply because the natural homomorphic transformation may be 
considerably complex (as is the case with a compiler). This, 
incidentally, is why simple co-routine examples don't exist. 
Simple examples tend to be homomorphic or at least expressible 
as simple homomorphic transformations. 


As stated above, the conventional co-routine protocol requires 
a jump-and-set-link instruction. No such facility exists in 
SNOROLU nor can one be programmed. The main reason for this 
is that in order for a statement to be pointed to, it must 
have a label; the 'pointer' is a string (identical to the 
label) and goto's are permitted by indirection (unary $). The 
STNO and &LASTNO keywords provide statement numbers which 
could be quite useful in this regard except for the fact that 
these numbers are entirely descriptive. No mechanism exists 
for going to a statement with some given number. 

In any event, it is not clear that a direct translation from 
assembly language is the form most useful to the SNOBOLU 
programmer. It is, in fact, more likely that we would want 
something closer to the normal function mechanism in which ar- 
guments are passed, values returned and temporaries saved. 
This is provided by the state function. 


Cae ee C 

(|! Program |! A state function is one whose next entry 
li 14.7 N point (its state) is determined by the 
(| STATEF E return. In particular, in our rendition, if 
A AA the next entry point is to be label ENTRY_2, 


then the goto should take the form 
: (RET('ENTRY_2!*)) 


Returning from a state function is done only by calling 
RET (label). 


SAA. 
| A State function is defined by a call to STATEF. It must | 
| not execute a RETURN but must pass control back via a call | 
| to RET(NEXT) where NEXT is the next entry point. | 
mmm | 
DEFINE ('STATEF (PROTO, LBL) NEWL*) 
DEFINE ("RET (NEXT) NAME!) : (STATEF_END) 


a EMEN CCMM CC A A AN 
| Entry point for STATEF. Determine the nominal entry point | 
( (LBL) for the state function. Then create a new label | 


| (NLBL) which will serve as the real entry point for the | 
| function. | 
LI Ve uds MM CRM A A IS | 
STATEF PROTO IDENT (LBL) BREAK (' (') . LBL 

NLBL = LBL ' ENTRY' 

DEFINE (PROTO, NLBL) 
a ag rag eee pte geen IUDICI ICE RN ee ee Oe eo oe 
| At this entry point we push our name so that upon return | 
{| we know what function we were in. { 
| —————— ——————————— ——— Ó—————————Ó———————————— HY ) 

CODE(NLBL "' PUSH('" NLBL "') :($" NLBL ")" ) 

$NLBL = LBL : (RETURN) 


NN RN ERE 
| Entry point for RET: Get the name pushed on entry. Assign | 
| our argument (NEXT) to this name so that we know where to | 


{| come back to next time. Then indicate a return. | 
|n DT -— lc— a ) 


RET NAME = POP() 

$NAME = NEXT 

RET = ¿RETURN : (NRETURN) 
STATEF_END 
Names_referenced Name Type Where defined 
by STATEF: PUSH Function Program 5.5 

POP Function Program 5.6 

Epiloque 


An example of the use of STATEF is given in Exercise 14.18. 


eo 

(| Program |i The functions PUSH, POP and TOP (Progs. 5.5, 
E 14.8 E 5.6 and 5.7) are fine if you only need one 
(! STACK BE Stack. What should one do if one requires 
———— more than one stack? We could provide an 


optional second argument to designate which of several stacks 
are intended. For example, PUSH(V,N) could push an item V on- 
to a stack designated by N. The principle disadvantage of this 
approach is that it produces code which lacks clarity. Another 
disadvantage is that an extra instruction must be executed in 
a rather simple function resulting in inefficiencies. To cor- 
rect these deficiencies,, we will incorporate the name of the 
Stack into the name of the function. For example, PUSHA (V) 
will push onto stack A the value V. In general any string may 
take the place of 'A! as a stack designator. 


TO automate the process of creating the stack functions, we 
will write a function STACK(suffix). STACK will define three 
stack-manipulation functions, POPsuffix,  PUSHsuffix, and 
TOPsuffix. For example, STACK('A') will define the three 
functions, PUSHA(V), POPA() and TOPA(). 
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DEFINE (' STACK (SUF) S*) 
DATA (' LINK (VALUE, NEXT) *) : (STACK, END) 


| Wed E IC MC MMC E ICI O ge Lee Merge Pete M TG KC IE RICE E CER CE 
I| Entry point: Assign to S a long string equal to the code | 
| we have to create except that the string 'SUF' is used | 
| where the suffix will eventually be placed. | 


 -———————————————————————————— | 


STACK S = 

+ ' PUSHSUF STACK SUF = LINK (V,STACK_SUF) ;' 
+ ' PUSHSUF = .VALUE(STACK SUF) : (NRETURN);' 
+ ' POPSUF IDENT(STACK SUF) : S (FRETURN) ; ! 
+ ' POPSUF = VALUE (STACK, SUF) o! 
+ ' STACK SUF - NEXT(STACK SUF) : (RETURN) ; ' 
* ' TOPSUF IDENT (STACK SUF) : S (FRETURN) ; ! 
* : TOPSUF - .VALUE(STACK SUF) : (NRETURN) ;* 


AAA T UU PEE UAE E NE E VU EU INE NAR 
| Now we create the required code and define functions. | 
| P —ez— €—Á—— — — (—— — ———— n — —— —X —— ——— Vc Ó— M e| 

CODE (REPL (S, 'SUF', SUF) ) 

DEFINE('PUSH' SUF '(V) ‘) 

DEFINE(*POP' SUF '()' ) 


DEFINE('TOP' SUF '()') : (RETURN) 
STACK END 
Names referenced Name Type Where defined 
by STACK: REPI Function Program 3.15 
Epilogue 


Note the use of the REPL function to create code. It is 
possible to avoid the use of REPL by a judicious concatenation 
of string constants and variables (try it) but it is im- 
possible to avoid going mad in the process. 
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In 
| Exercise 14.1 | If we attempted to define MAX (X,Y) by means 
AAA Of: 


DEXP ('MAX (X,Y) = X ; MAX = GT(Y,X) Y ') 


we would experience a difficulty. (a) What is it? (b) What 
simple change in this call will correct things? 


RS | 
| Exercise 14.2 | Modify DEXP (Prog. 14.1) so that iden- 
CL—————————————-4  tifiers following the argument list are 


regarded as function  temporaries (requires modifying one 
statement). 
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a ee feo 

| Exercise 14.3 | The encoding of LOADEX (in Prog. 14.2) as- 
AAA sumes no syntax error in the external code. 
(a) Modify LOADEX so that if the external code contains a syn- 
tax error it will print out the code and establish a function 
body which will always fail. 


zac ae 
| Exercise 14.4 | Rewrite DEXTERN so that it operates by 


t—————— tracing. That is, on first call of the in- 
dicated function, a routine is called which loads the function 
(you may use LOADEX to simplify matters). Be sure to issue a 
STOPTR after loading the function. 


Ce A se ee 
| Exercise 14.5 | A particulary long program consists of sec- 


of these sections are in use in any given run. But, depending 
on the data, any section could be reached. Using LOADEX, how 
could you replace these sections with something smaller? 


Ce ee ee 
| Exercise 14.6 | Encode FTR_CALL and FTR_TRC to trace func- 
LLL-———————————-A tions as required by FTRACE (Prog. 14.3). 


Ce ee ee 
| Exercise 14.7 | Should the definition of  FTR CALL and 


t  FTR RET precede or follow the definition of 
FTRACE or does it not make any difference? 


| pg ME MMC ye | 

| Exercise 14.8 | Modify INSULATE (Prog. 14.4) so that it 
SW doesn't depend on TRACE to obtain control 
on calls or returns. 


ep ee er ON 
| Exercise 14.9 | How could INSULATE be used to guard against 
t————— modifications of the ARB variable? 


ee en ee 
| Exercise 14.10 | Define a complex number by the structure 
A | 


DATA ('COMPLEX (R, I) ') 


where R is the real part and I is the imaginary part. with 
the help of REDEFINE (Prog. 14.5) extend the binary operators 
+, -, *, / and the binary functions GT, GE, LE, LT, EQ, NE to 
operate on complex numbers if one or both of the arguments are 
complex. To simplify things, write a generalized argument 
processing function which will succeed if both arguments are 
not complex and will otherwise fail converting any non-complex 
argument to complex. 
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y 59r ene pe] 
| Exercise 14.11 | Assuming that the binary arithmetic 
CJ operators have been redefined to operate 
on COMPLEX quantities as in the previous exercise, can the 
PHYSICAL package also be used with the VAL field a possibly 
complex quantity? Said another way, what trouble spots are 
there in compounding redefinitions along the lines suggested? 


(SS ee ON 
| Exercise 14.12 | Redefine the arithmetic operators to 
AÑ, operate on identically-dimensioned arrays. 


E | 

| Exercise 14.13 | Ordinarily a function such as F() cannot 
WS set the variable F as a side effect since 
the value of F is saved at the call and restored on return. 
Strange as it seems, however, a technique exists to do 
precisely that. In particular, it is possible that F(X) will 
assign the value of X to the variable F. Define such an F. 


Ges ge ag EGG oe ee 
| Exercise 14.14 | Generalize the previous exercise. That 


A is, define a function DEF (NAME) such that, 
for example, DEF ('F*) will establish F(X) as equivalent to: 


F= X 
Goce sp a ee ee ee 
| Exercise 14.15 | Rewrite STATEF (Prog. 14.7) such that ona 


3 return via the call RET(LABEL) the func- 
tion DEFINE is called with LABEL the new entry point. 


Op ee ee ea ee UN 

| Exercise 14.16 | In the epilogue to PHYSICAL (Prog. 14.6) 
AAA we expressed the quantity 200 FT. with an 
intervening asterisk (denoting multiplication). This could 


have been avoided by redefining concatenation (a purifying 
experience). What four statements need be added to PHYSICAL 
so that concatenation as well as multiplication form the 
product of physical units. (Hints: Be cautious of a circular 
definition, i.e. using concatenation to define concatenation, 
unless the recursion stops. Don't worry about the various 
predicate uses of concatenation since your program won't get 
control if one of the items to be concatenated fails.) 


oee 
| Exercise 14.17 | Add an FRET (NEXT) function to provide an 
AS  FRETURN facility to STATEF (Prog. 14.7). 


A ee ee 

| Exercise 14.18 | Draw a state transition table for a lex- 
AJA ical analysis of SNOBOLU expressions 
(i.e., assume no labels, no pattern matching, no goto-fields, 
just expressions) as follows. For each state and each token 
(left parenthesis, identifier, number, operator, etc.) direct 


SED A ee ee ee EE GP ATADO ARA GEE a a eS 


an arrow to the next state and indicate what, if anything, is 
to be returned. Implement this as a state function. 


(cc: yp a te | 

| Exercise 14.19 | Write a function FUNCTION (NAME) that will 
LLLL————————————-A succeed returning the null string if NAME 
is the name of a programmer-defined function. Otherwise it 
should fail. Hint: the definition of function should appear 
before every other function. For extra credit, any name 
OPSYN'ed to some other name should also be regarded as a 
programmer-defined function. 
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rc— ven special-purpose programming languages require 
i. arithmetic. The original SNOBOL contained the five 
r— arithmetic operators (+, -, /, *, **) which operated 
t—, only on strings (that resembled integers) within a 
tv limited form of expression (eg. no parentheses). 
SNOBOL3 allowed more freedom (e.g., parenthetical groupings 
were permitted) in forming expressions but retained the string 
format for representing integers. SNOBOLY broke with the 
tradition of the single datatype and introduced both INTEGER 
and REAL as separate types. Moreover, it represented these 
objects internally as machine integers and reals (i.e. 
floating point numbers) respectively. Hence, a study of 
SNOBOLU numbers, in contrast to previous SNOBOL's, is very 
much a study of how they are represented on most machines. 


Most machines for which SNOBOL4 has been implemented are 
binary machines representing integers in base-two notation. 
In every case known to the author, the negatives are represen- 
ted in two's complement form. This is the binary equivalent 
of representing, say, -2 by a number of the form 999...99998. 
Hence, the range of integers is usually 


W-1  W-1 (15.1) 


where W is the number of bits in the field allowed for in- 
tegers. Usually, W is the word size of the machine. For 
example, on the IBM 360/370 implementation of both SNOBOL4 and 
SPITBOL, the range of integers is [-231, 231-1]. 


The first several programs offer some examples of integer 
manipulation, the last of which (INFINIP) being aimed at over- 
coming the restrictions imposed by a finite word size. 


CASA ee ee 
({ Program |! The function COMB(N,M) will return the num- 
E 15.1 11 ber of combinations of N things taken M at a 
if COMB 11 time, usually written in 'over' notation as 
—————— shown and defined below: 
r 23 
INI N! 
COMB(N,M) = | | ES (15. 2) 
| M | (N-M)! M! 
t J 
where N2M 2 0. By convention 0! = 1. For N < M the value 


of COMB, by convention, is 0. COMB(N,M) may also be regarded 
as the coefficient of X ** M in the expansion of (X + Y) ** N 
and is therefore called the binomial coefficient. It is il- 
lustrated by the easily remembered Pascal's triangle: 
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SP OS Ee SEED ED GUN ES ED CD CP SS co ED ED Pe AD ee > ee ee Oe 


in which N corresponds to the row (starting with 0) and M cor- 
responds to the position within the row (starting with 0). 
Note that each term may be found by adding the two elements ' 
immediately above it. Hence we have a simple recursive method 
for computing COMB(N,M). A slightly more efficient method is 
used below which is based on the identity: 


N N-1 
M 


r 
| 
l 
{ M-1 
t 


m———A4 


1 
| 
| (15.3) 
i 


lo am — m | 
| 
ziz 


provided M > 0. 


CE A E A E p I ee N ge gee E eee 
| COMB(N,M) returns the number of combinations of N things | 
| taken M at a time. | 
A AR | 


DEFINE ('COMB(N,M)?*) : (COMB END) 
COMB COMB = EQ(M,0) 1 < S (RETURN) 
COMB = COMB(N - 1,M- 1) *N/M : (RETURN) 
COMB_END 
Epilogue 


Note that we do not write COMB in terms of factorials as this 
may needlessly result in integer overflow during the calcula- 
tion of intermediate results. An alternative approach is to 
write COMB iteratively and is to be recommended if time is an 
issue. This is left as Exercise 15.1. A rather bizarre method 
for computing COMB relies on pattern matching. This too is 
left as an exercise. 


[o Ge ee | 

(|! Program {| We have seen several methods of representing 
E 15.2 B numbers, the Roman system, the positional 
(| DECOMB 11 number systems (BASEB and BASE10, Progs. 2.4 
4 and 2.5) and the factorial number system 


(PERMUTATION, Prog. 12.1 and its prologue). The combinatorial 
number system is yet another number system where a sequence of 
integers can be used to represent a presumably larger integer. 
Given a fixed number n called the nome, one can represent any 


positive integer K by a vector Kn, ... , Ko, K, such that 


Spot ts PROGR aM 15.2 - DECOMB eo Sh Pdge 321 
r 3 r 3 r 3 
| Kn 1 l K2 | Ó Ky | 
K = | | + o... + | | + | | (15.4) 
ln | 12 | i 1 41 
L 3 L J L 3 
Moreover, if we add the restriction that: 
the representation is unique. The values Kn, ..., Ko, K, are 


called cogets (as opposed to digits). The combinatorial number 
system can be used to find a uniformly distributed evaluation 
of poker hands (POKEV, Prog. 17.6) and this relies mainly on 
the fact that cogets are monotonically decreasing. 


To see that the representation is unique (for a fixed nome) 
note that if the cogets assume their least value (K,=0, Ko=1, 
eee, Kn=n-1) we obtain K=0. Next, we assert that if the cogets 
assume their largest value with Kn=M, then K will be incremen- 
ted by exactly one if Kn is increased by one (to M*1) and all 
other cogets are made as low as possible. That is: 


hn am ad 
+ 
p m eee 


1 r 
| l 
| + ... + | 
| I 
r t 


That this is true follows from the rule of forming Pascal's 
triangle, viz. 


1 
| (15.6) 
f 


The second of the two terms on the right is decomposed ac- 
cording to this formula and this is continued until the '1' is 
reached. 


Finally note that increasing K, by 1 increases K by 1. From 
these three observations, it follows that all integers are 
representable and that their representation is unique. 


DECOMB(S) will regard S as a sequence of cogets, i.e. a number 
in the combinatorial number system, and will return its cor- 
responding integer value. Cogets are represented as characters 
from an alphabet (COMB_ALPHA) much as we have previously done 
with positional representations. 
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DECOMB(S) returns the decimal number equivalent of the ar- | 
gument S regarded as a representation in the combinatorial | 
number system. | 
A A A IN A | 


DEFINE (' DECOMB (S) T!) 


re Oe A MM IM CQ eg One ge eae ee 
| 
| 
| 


COMB ALPHA = 10123456789ABCDEFGHIJKLMNOP!' 
: (DECOMB_END) 
DECOMB S LEN(1) . T = : F (RETURN) 
COMB ALPHA @K T :F(FRETURN) 


DECOMB = DECOMB + COMB(K,SIZE(S) + 1) : (DECOMB) 
DECOMB END 


Names referenced Name Type Where defined 
by DECOMB: COMB Function Program 15.1 
Epilogue 


For additional information concerning the combinatorial number 
system see Lehmer [1964] or Whitehead ( 1973]. 


| ra: | 

i| Program || INFINIP is a package of infinite precision 
E 15.3 N arithmetic (i.e. integer) functions. Large 
{{ INFINIP 11 integers are represented by strings of 
Y EXA digits and so the size of integers permitted 


is not quite infinite but is limited by the maximum length of 
strings. This is generally quite large so that for all intents 
and purposes the precision may be regarded as infinite. 


INFINIP redefines virtually all arithmetic operators to handle 
large integers in an upward compatible way. This facilitates 
their use, and makes them plug-in-able to routines that have 
already been written using conventional facilities. It also 
serves to make the algorithms themselves clearer, since they 
are written, in part, recursively. 


INFINIP has applications in addition to generating numerical 
wall-paper. For example, it can alleviate some rather severe 
restrictions encountered in base conversions (BASEB and 
BASE10, Progs. 2.4 and 2.5) and permutation generation 
(PERMUTATION, Prog. 12.1). 


Our basic operating philosophy in writing INFINIP was not 
speed. A linked-list approach would probably have been 
considerably faster. Our main goal was to produce a legible 
and flexible package that could serve (a) to produce the ef- 
fect and (b) as a kind of extended precision laboratory in 
which different algorithms could be tested. Techniques used 
to implement infinite-precision arithmetic can also be found 
in Knuth (Vol. 2], Blum [1965], and Collins [1966]. 
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LESER AE E EC ea IN EIC CM eye re ag ee UM Ge Sigh KC aera iw eral 
| INFINIP - an infinite (just about) precision arithmetic | 
| package. The following operators and built-in functions | 


| are redefined. | 
| ee | 
REDEFINE ('-', 'MINUS (X) Y") 
REDEFINE( ,'GT (X,Y) ') 
REDEFINE(  ,'EQ(X,Y)') 
REDEFINE( ,'GE(X,Y)') 
REDEFINE( ,'NE(X,Y)') 
REDEFINE( X ,'LT (X,Y) ‘) 
REDEFINE( . ,'LT(X,Y)') 
REDEFINE( X ,'LE(X,Y)') 
REDEFINE ('-', ' DIFF (X,Y) ') 
REDEFINE ('+*,'SUM(X,Y)X1,X2,Y1,Y2,K') 
REDEFINE (! *! , "MULT (X,Y) X1,X2,K') 
REDEFINE('/','DIV(X,Y)X1,X2,Y1,Y2,T,T1, T2, KX, KY") 
REDEFINE( ,'REMDR(X,Y)*) 
SS SS A ee 


| Pattern definitions: | 
AA A A A IS ES AAN IS A A | 


SIGN OFF = POS(0) '-' 
LDG ZEROS =  BRFAK('1234056789') | RTAB(1) 
NO DIGITS - 8 


EDEN DNI e NR ECCE NE ECCL CC I -————) cc E ILLO XD Hcc D M ICI KC UIS | 
( Utility functions | 
AAA SE REN ERR IS IERRREY, 
DEFINE (' SMALL () *) 
DEFINE ('SPLIT (NAME, PAT) *) : (INFINIP END) 


GERMEN IR EE EID CI x x Dc MCN C EI E EC MM ICM FTE Nw Ge E ECC MMC eG RS TCC E ATAN 
( SMALL() will succeed if X and Y are small integers defined | 
( strategically as integers whose sum or difference will not | 
| cause overflow. Tactically, they are defined as numbers | 
| whose digits do not exceed NO DIGITS. | 
| — ———ÁÀÁ—— Há—Ó—— —— ———————————— ————————mosnP——— | 
SMALL (LF. (SIZE(X) ,NO DIGITS) 

+ LE. (SIZE (Y) ,NO_DIGITS) ) :S (RETURN) F (FRETURN) 


De RS CAM Cc CCMIC MEC E eg CEDE ICI CE LM ARAS 
| SPLIT(NAME,PAT) will split the named string into two 

| parts, NAME1 and NAME2 (after removing leading zeros). It 
| returns the amount of the split measured from the right. 
| The split is determined by the incoming pattern (PAT); if 
| this is null the split is approximately half. 


|——————— ————— —— ————— —— —— — ——dee————— —M——— €—sneP | 


SPLIT PAT =  IDENT(PAT) LEN(SIZE($NAME) / 2) 

$NAME (PAT | '') . $(NAME 1) 8SPLIT (SPAN('O') | *!) 
+ REM . $(NAME 2) 

SPLIT = SIZE($NAME) - SPLIT : (RETURN) 


A EES | 
| Unary minus - Remember, REDEFINE establishes MINUS. as the | 
{ old MINUS built-in. | 
| —————— ————aÁ-——A—— —————À !— —— —— -Ce"—————— !M—— —— | 


MINUS MINUS =  SMALL() MINUS. (X) :S (RETURN) 
MINUS - X 
MINUS SIGN OFF - : S(RETURN) 
MINUS = '-' yx : (RETURN) 
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E O ee | 
| The predicates - They assume integers in normal form (i.e. | 


( no leading zeros). i 
Lonni 


GT SMALL () :F(GT_1) 
GT. (X,Y) :S (RETURN) F (FRETURN) 


GT_1 X SIGN OFF = :F(GT. 2) 
Y SIGN OFF - :F (FRETURN) 
SWAP (. X, . Y) 


GT 2 Y SIGN OFF = eS (RETURN) 

LGT (LPAD(X,SIZE(Y),'0'), 
+ LPAD(Y, SIZE (X),'0')) <S (RETURN) F (FRETURN) 
EQ SMALL () :F(EQ_1) 

EQ. (X, Y) :S (RETURN) F (FRETURN) 
EQ 1 IDENT (X, Y) :S (RETURN) F (FRETURN) 
GE 4(-GT(X,Y) -EQ(X,Y)) :S (RETURN) F (FRETURN) 
NE EQ (X, Y) :S (FRETURN) F (RETURN) 
LT GE (X, Y) :S (FRETURN) F (RETURN) 
LE GT (X, Y) :S (FRETURN) F (RETURN) 
aux cS C GI MDC EP DC RNV CIE VDO CIE CM I CC UD DS ILLE ADMI MEE UM IC ee Sg E C CC RM M MESE, | 
| DIFF(X,Y) - Let SUM(X,Y) handle it. i 
AA — ————— | 
DIFF DIFF = X * -Y : (RETURN) 
a NI IMG RS ER MR 
{| SUM(X, Y) - There are essentially two cases: plus plus | 


| and plus minus. We first reduce to cases. | 
A E SE EAEE ES E RE ENTE EN ROL | 


SUM SUM = SMALL () SUM. (X,Y) 2 S (RETURN) 

SUM = LT(X,0) -(-X + -Y) : S (RETURN) 

Y SIGN OFF = :S(SUM 1) 
GE MMC ICM "cM REIN CD CF C CC CC CE ON en . 
| Here is plus plus. Simply divide and conquer. { 


| gp ee EDO A a | 
(LT(X,Y) SWAP(.X,.Y)) 


K = SPLIT(.X) 

Y = Y + X2 

SPLIT (. Y, RTAB (K) ) 

SUM = (Y1 + X1) LPAD(Y2,K,'0') : (RETURN) 
i eg EDI CPC OA eee S 
| Here is plus minus. Make sure X > Y. Then add the 10's | 


( complement of Y. | 
-—————————— Á————————————————————————————— ——————n | 


SUM 1 SUM = GT(Y,X) -(Y - X) :S (RETURN) 
Y = LPAD(Y,SIZE(X),'0') 
SUM = X + 1 + REPLACE (Y, '0123456789!* ,*9876543210') 
SUM ‘'1' BLDG ZEROS REM . SUM < (RETURN) 
FÉ A EC MEME C MM IMMER 
MULT(X,Y) - Multiply is fairly simply written especially | 


i 

| if we concentrate on reducing the size of one argument at | 
la time. Note that the test for small size is somewhat | 
| different here. | 
A A A RUE MR CENE REC TRA e A Ua t p RE TR EBEN 
MULT MULT = LE(SIZE(X) + SIZE(Y),NO DIGITS) 

" MULT. (X,Y) : S (RETURN) 
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MULT = LT(X,0) -X * -Y :S(RETURN) 

MULT = LT(Y,0) -(X * -Y) : S (RETURN) 

(GT(Y,X) SWAP(.X,.Y)) 

MULT = EQ(Y,0) 0 

K = SPLIT(.X) 

MULT = (Y * X1) DUPL('0*,K) 

MULT = MULT + X2 * Y : (RETURN) 
DIV(X,Y) - First we handle negative arguments much as we 


did with multiply. The next part, more than any other 
section requires some explanation. Imagine a long division 
operation with two (rather large) digits Y1, Y2 being 
divided into two other large digits X1, X2. The trial 
divisor T1 (on top of the line) is multiplied by the 
divisor Y and subtracted from the left end of X to produce 
error term T. This term is then divided by Y to obtain a 


final adjustment. 
A II A A | 


DIV DIV = SMALL() DIV. (X,Y) : S (RETURN) 
DIV = LT(X,0) -(-X / Y) :S (RETURN) 
DIV = LT(Y,0) -(X / -Y) :S (RETURN) 
DIV = GT(Y,X) 0 :S (RETURN) 
KY = SPLIT(.Y,LEN(NO DIGITS / 2) | REM) 

KX = SPLIT(.X,LEN(NO DIGITS)) 

TI = X1/ Y1 

T2 = DUPL('0*, KX - KY) 

T = X - ((T1 * Y) T2) 

DIV = T1 T2 

T = LT(T,0 T+1-Y 

DIV = DIV + (T/ Y) : (RETURN) 


O IE a SES IC CIL CE MM GREC C DM SEES | 
{| And last but not least, REMDR. | 
Op EI ee E e 
REMDR REMDR = X - (X 7 Y) * Y < (RETURN) 
INFINIP END 


Names referenced Name Type Where defined 
by INFINIP< REDEFINE Function Program 14.5 
SWAP Function Program 3.14 
LPAD Function Program 3.2 
| S$% EALs and Mixed Mode | REALS’ consist of three fields, 
| $ AA AS a sign bit, the exponent (or 
| £88 | characteristic) and the mantissa. The exponent in- 
(XK | dicates the extent that an assumed base must be 
1% *| raised whereas the mantissa represents the most 


CM significant bits of the number. In symbols: 


exponent 
NUMBER = mantissa * base 


REALS, of course, vastly increase the range of numbers 
representable at the sacrifice of precision. While the par- 
ticular details of representing floating point numbers differ 
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from machine to machine, there are none-the-less a few general 
practices which most machine manufacturers adhere to: 


The three fields of a floating point number are arranged in 
their order of significance and adjusted so that comparison of 
two quantities can be made using the same arithmetic  com- 
parator as integers. This places the sign bit in the first 
position, followed by the exponent and then the mantissa. TO 
facilitate comparisons, the exponent is represented in so- 
called excess notation with the most negative exponent 
represented as 00...0 and the highest as 11...1. Also, the 
mantissa is normalized to produce, for any given number, a 
unique exponent, again, so that the comparison can be carried 
out. The mantissa is normalized by shifting it to the left 
and decreasing the exponent until further shifting destroys 
information. The mantissa is generally assumed to represent a 
fraction just less than 1. With a binary base, the lead digit 
of the normalized number is always 1 and so represents redun- 
dant information. It can, and actually has been, omitted on 
at least one machine (the PDP-11). By convention, a floating 
point 0 is represented as an all-0 word. On the PDP-11 it is 
the only bit pattern not otherwise used. 


The IBM 360 uses a base of 16 and hence the normalization 
process may not produce, in the mantissa, a leading bit of 1. 
Rather, the leading four bits must contain a 1. For this 
reason, numbers whose leading hexadecimal digit is low (such 
as 1 or 2) cannot be represented very accurately (the error as 
a fraction of the number is relatively large) and hence the 
need exists on the 360, more than on most other machines, for 
double and quadruple precision. 


We will speak (loosely) of the range of REAL numbers and by 
this we will mean roughly the extremes of values the REALS can 
achieve. These can be very high, very low or very negative 
and are governed almost solely by the base and the maximum ex- 
ponent. We will speak of the precision P as meaning the binary 
precision given generally as: 


P = M - Logs B 


where M is the size of the mantissa in bits (including in- 
visible bits) and B is the base of the exponent. Approx- 
imately, the precision is the negative log (to the base 2) of 
the relative error of a number due to the finite resolution of 
the representation. 


It should be noted that integers up to 2**M, or so, can be 
represented exactly as REALs and that operations such as plus, 
minus and multiply are exact provided no intermediate results 
exceed this limit. 


The rules governing mixed expressions in SNOBOLY are similar 
to those in Fortran. If the two operands of a binary arith- 
metic operator (other than **) or a binary comparator (GE, EQ, 


.....Page 327 


etc.) have different types (one INTEGER and the other REAL) 
then the integer is converted to REAL before the operation 
proceeds. SPITBOL contains a DREAL type (double precision) 
and if one of the arquments to such an operation is DREAL then 
the other is converted if necessary to DREAL. 


One important difference with Fortran (or PL/I for that mat- 
ter) is that the types are not declared but are contained as 
part of the value. This means that it is possible to write a 
routine which can accept either type as argument and return a 
correct result. For example, assuming we wish to write a 
routine RECIP(X) which will return the reciprocal of the num- 
ber X, we can simply write: 


RECIP RECIP = 1.0/7 X < (RETURN) 
This routine will operate correctly whether the argument is 


INTEGER, RFAL, or DREAL. 


se ee, a MOREM 


E Programs 11 FIOOR (X) is defined as the largest in- 
(| 15.4 8 15.5 11 teger not greater than X. CEIL(X) is 
(| FLOOR 6 CEIL || the smallest integer not less than X. 
EE RE They are both related (nonlinearly) to 


the integer conversion facility which truncates toward zero. 


DEXP('CEIL(X) = -FLOOR(-X) ') 
DEFINE (' FLOOR (X) ') : (FLOOR END) 
FLOOR FLOOR = CONVERT(X, 'INTEGER') 
GE (X, 0) : S (RETURN) 
FLOOR = NE(X,FLOOR) FLOOR - 1 < (RETURN) 
FLOOR_END 
Names referenced Name Type Where defined 
by FLOORCEIL: DEXP Function Program 14.1 
Epiloque 
FLOOR and CFIL, in addition to illustrating how 


CONVERT (, ' INTEGER!) behaves, are of interest in their own 
Yight. Below, let N be an integer and let X be a real. Then: 


N > CEIL(X) <==> N >X 
N < CEIL(X) <==> N< xX (15.7) 
N < FLOOR(X) <==> N <xX 
N > FLOOR(X)  <==> N > X 


These identities can be used to solve some interesting integer 
inequalities in a straightforward fashion. (See Exercise 
15.9.) 
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| ££% ranscendental Functions | A transcendental function 
| € e is one that cannot be writ- 
| & | ten (finitely) using the four fundamental operations 
( £ | of addition, subtraction, multiplication and divi- 
(| £ | sion. Examples include the sine and other 


QQ trigonometric functions, logarithms, etc. These may 
be represented as an infinite series (power series, Taylor 
series) of terms involving X**n where n = 0, 1, 2, +... and X 
is the argument. This represents a readily available com- 
putational method which is often the best technique if the 
precision of the machine is unknown; i.e. if the computation 
is to be machine-independent or if it is to be equally valid 
for single and double precision. 


Where the precision is known, a much more efficient technique 
is the so-called Chebyshev interpolation method. Since most 
libraries are written for a specific machine, this method is 
widely used and a little knowledge is helpful if only for the 
purpose of pirating existing code. Let us assume that we wish 
to approximate the function f(x) with an nth degree polynomial 
p(x) and, moreover, suppose that we wish p(x) to be the best 
such approximation in the so-called mini-max sense. That is, 
the maximum deviation from f(x) in some fixed range should be 
a minimum for all polynomials of that degree. We can im- 
mediately deduce a property that p(x) must have. Suppose some 
polynomial q(x) existed which had the same degree as p(x) and 
had the same lead coefficent of x**n and was such that the er- 
ror of this approximation, f(x) - q(x), varied from a maximum 
of +M to a minimum of -M back to +M, to -M, etc. Suppose that 
there are exactly n*1 such maxima. Such polynomials can always 
be constructed, as we will see. Now suppose that q(x) is not 
as good an approximation as p(x). Then each of the local max- 
ima are greater deviations than the largest deviation of f(x) 
- p(x). That means that 


(f(x) - p(x)) - (£(x) - q(x)) = g(x) - p(x) 


must oscillate back and forth across the abscissa; this means 
that there are n solutions to an (n-1) degree equation. This 
is impossible and hence we conclude that q(x) had to be at 
least as good in the mini-max sense as p(x). This is quite 
startling in view of the fact that no assumptions at all about 
the magnitude of M were made. Polynomials which oscillate 
about the axis n times over a given interval are derived from 
the oscillatory nature of the sine wave and are known as 
Chebyshev polynomials. We have no time or space to pursue this 
fascinating topic in greater detail but we may recommend Fox 
and Parker [1968] or Hastings ( 1955] for further reading. 


The result of a Chebyshev approximation is a polynomial of the 
form 


2 n (15.8) 
C+C,y X + Co X to... + Cy X 
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which is actually computed as: 
to minimize operations. 


It is interesting to note that approximations of this kind can 
be found by an adaptive process in which successive approxima- 
tions converge to the desired polynomial. Fox and Parker 
(1968, p.74] describe such a procedure originally due to 
Novodvorskii and Pinsker. Hence it would be possible to write 
a SNOBOL! program to produce coefficents automatically for any 
given function, range and desired accuracy. 


For a known function and a fixed precision, the Chebyshev in- 


terpolation coefficients can usually be looked up. Hastings 
(1955] is an excellent source. If unavailable, Handbook [NBS] 
should be adequate. For any specific machine, there has 


prokably been some work done towards constructing a 
mathematical library, and such sources, if they exist, can 
often provide routines carefully tailored for a specific en- 
vironment. One excellent source for the IBM 360 is IBM [ 360f]. 


The functions to follow are machine independent programs for 
computing many of the common transcendental functions. The 
results returned should ke as precise as the arguments given, 
with the exception that DREAL precision in some cases may not 
obtain merely because one or more internal constants have less 
than DREAL precision. This difficulty is easily overcome and 
some exercises explore such modifications. 


One problem that arises in writing machine-independent al- 
gorithms is determing the proper accuracy. For example, sup- 
pose we wish to compute the sum of the series: 


where 0 < X < 1/2. Ignore for the moment that the sum of the 
series is 1/(1-X) and suppose that we wish to calculate the 
same result in brute force fashion. How do we know when to 
stop adding new terms. We might think of setting up a 
PRECISION variable (adjusted for each machine) such that when 
the terms of the series fall below the quantity PRECISION * 
SUM, where SUM is the partial sum so far computed, we quit. 
This method has the disadvantage of being machine-dependent 
and does not give double precision results if X is DREAL. 
Hence we will avoid this method and employ a scheme to let the 
machine tell us when to quit. This will have the happy 
property of adapting to any machine and any precision. Our 
test is, in effect: 


EQ(SUM , SUM * X ** n) 


which means that in order to add X**n to our number we have to 
shift is so far to the right that all its '1' bits are lost. 
This is implemented by saving the old value of SUM in a tem- 
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porary (T) and comparing, updating and branching all in the 
same statement at the base of the loop. The following state- 
ments compute the SUM of (15.9) according to this method. 


T = 0 
SUM = 0 
TERM = 1.0 
LOOP TERM = TERM * X 
SUM = SUM + TERM 
T = NE(SUM, T) SUM :S (LOOP) 


The reader is cautioned that this stopping test is not equiva- 
lent to: 


EQ (TERM, 0) 


If continually multiplied by X, TERM will ultimately become 0 
(or raise machine underflow which many SNOBOL4's regard as an 
error) but not before it falls below the range of small num- 
bers (a typical value is 2-128) whereas to be negligible in 
the computation it need merely be below X * 2-25 or so. Hence, 
even if underflow were not raised, the test would be quite 
inefficient. 


| eas | 

II Program || SORT(Y) will return the square root of the 
E 15.6 li REAL number Y. The returned precision will 
1! SQRT Ó (0 equal the precision of Y. The algorithm used 
 _ A MMMM is an excellent example of Newton's Method 


for solving implicit equations, which goes as follows. Suppose 
we wish to solve the equation: 


f(x) = 0 


for x, and suppose further that, given x, we can compute f (x) 
and the derivative f'(x). Starting with an estimate, x,, for 
x, we can compute f (x,). Since this is supposed to be zero, 
we can estimate how far we are off by dividing this number by 
the slope f' (x,). We can then modify x, to obtain a new, and 
closer, estimate x2 according to the formula: 


X2 = XK, -  f(x1) / f' (x4) 


With the new estimate, a new error and slope are calculated 
and the process is repeated until the desired accuracy is ob- 
tained. In many cases, the computation converges rapidly to a 
correct solution. The rate of convergence and the question of 
convergence are decided by algebra for any particular case. 
To determine if the desired accuracy has been reached, we will 
wait until 
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| f (Xn) 
£* (xn) 


S 
A 
J 
b 
x 
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ha am «d 


As previously stated, this will adapt to any machine and any 
argument. 


To obtain the initial estimate, x,, we draw a line tangent to 
the curve, x = y? at the point (1,1). This curve, y = (x*1)/2, 
yields an estimate of the square root which is good for x 
close to 1, but quite poor for very large or very small values 
of x. While Newton's method will eventually converge on the 
correct value, the error is reduced by only a factor of 2 for 
large errors; this contrasts with a factor of 2/e for small 
errors (See Exercise 15.11). Hence, for efficiency purposes, 
the numbers are brought into an acceptable range by (a) inver- 
ting, (b) dividing by 4096, and (c) dividing by 16. Powers of 
two are used for range reduction, as opposed to powers of 10, 
as these operations can be done exactly on a binary machine. 
On the IBM 360/370, the exponent is a power of 16 (for this 
reason, it is sometimes regarded as a hexadecimal machine) and 
hence, powers of 16 are used where possikle. 


DEFINE (' SQRT (Y) T, ERR, SLOPE!) : (SQRT. END) 


IgE RMESPEG ERAI RR E DG Tq IMEEM C MEER EE ER ND! 
{ Entry point: Range reduction and initialization. i 
| ——— —— —— — ——— —Q————————————————————————— E | 


SQRT LT (Y,0) : S (FRETURN) 
EQ (Y,0) :S (RETURN) 
SQRT = LT(Y,0.05) 1. / SQRT(1. / Y) : S (RETURN) 
SORT = GT(Y,4096) SQRT(Y / 4096.) * 64. :S (RETURN) 
SQRT = GT(Y,16)  SQRT(Y / 16.) * 4. : S (RETURN) 
SORT = (Y + 1.) / 2. 
T = SQRT 


qq SS I eas 
| Successively increase the precision of our estimate | 
Ac | 


SQRT_1 ERR SQRT * SQRT - Y 


SLOPE = 2. * SQRT 

SORT = SQRT - (ERR / SLOPE) 

T = LT(SQRT,T) SQRT :S(SQRT 1) F (RETURN) 
SORT END 
Epiloque 


The speed of SQRT can be increased (by about 30%) by an al- 
gebraic condensation of the inner loop. This is left as an 
exercise. 
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po er ee ae ey UN 

(! Program |! By elementary trigonometry, if we can obtain 
11 15.7 EN any one of the six trigonometric functions, 
E TRIG N viz. sine, cosine, tangent,  cotangent, 
td secant or cosecant, we can obtain them all. 


Cotangent, secant and cosecant are merely reciprocals of tan- 
gent, cosine and sine respectively and are therefore not 
represented as functions here. Tangent and cosine are given 
in terms of the sine. 


The algorithm for sine is from Beeler, et al (1972, p. 75] and 
relies on the following trigonometric identity: 


sin A = 3 sin(A/3) - 4 sin3(A/3) 


The identity is normally given as sin 3A and we speak of 
'triple-angle' formulas. Collections of such identities are 
available in many handbooks such as Handbook [CR] and Handbook 
[NBS]. This formula is a recursive formula for obtaining the 
sine of an angle in terms of a smaller angle. If the angle 
ever becomes small enough we can say it equals itself (the 
angle is presumed to be given in radians and we assume the 
reader knows that one radian is 57.30 or 180/PI degrees). 
Again, the issue of when to terminate arises and this is done 
when subtracting off 4*sin3(A/3) does not modify 3*sin(A/3). 
But this test must be made before sin(A/3) is called or else 
we will have an infinite recursive plunge. Hence we do the 
test on A/3. If equality obtains for A/3 it must also obtain 
for the slightly smaller value sin(A/3). Thus the algorithm 
terminates when 4*(A/3)3 is insignificant compared with 
3x (A/3) , or, equivalently, when 4*A2 is insignificant compared 
with 27. With 25 bits of precision, for example, this happens 
if A is 2-12 or so. Since A decreases by thirds, we will re- 
quire eight recursive calls or so before the function is 
evaluated. This will depend somewhat on the original argument. 
By using other identities, the amount of recursion required 
can be considerably reduced. See Exercise 15. 12. 


DEFINE ('SIN(A)K!) 
DEFINE ('SIN. (A) *) 
PI. = 3.14159265358979 : (SIN END) 


Wy M NIA GIALLE LK ACC CMM EID QC CCCII M PCI Cx CC MMC nega C O PINO CI DLP E KDE AE 
| Entry point: reduce range to [0, 2 PI.) | 
An AN | 


SIN SIN = LT(A,0) -SIN(-A) : S (RETURN) 
SIN = LT(A,2 * PI.) SIN. (A) : S (RETURN) 
K = CONVERT(A / (2 * PI.) , 'INTEGER') 
SIN = SIN.(A - K * 2 * PI.) : (RETURN) 


Pe EIN EE E ae MODI I" c (c CM LL a ee LIN C d a c eee Ge a ee 
| Test and return or plunge recursively and adjust. | 
| CS II IL A A IS | 


SIN. SIN. = EQ(27., 27. - 4 * A * A) A : S (RETURN) 
A = SIN. (A / 3.) 
SIN. = A * (3- 4 * A * A) : (RETURN) 


SIN, END 
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p————————————————————————————————————————À 
{| Standard identities yield other trigonometric functions. | 
A RR TNCS A A td o Ri eda Se S 
DEXP ('COS(A) = SQRT(1 - SIN(A) ** 2)'") 
DEXP('TAN(A) = SIN(A) / COS (A) ') 


Names referenced Name Type Where defined 
by TRIG: SQRT Function Program 15.6 

DEXP Function Program 14.1 
Epiloque 


The reason for the separate recursive routine (SIN.) is to 
save time (no need for range checking after its done 
originally) and space on the recursive stack (no need to con- 
tinually push K). 


fn en a ae ET 

(| Program || The functions ASIN(X), ACOS(X) and ATAN (X) 
t ! 15.8 E will return respectively the arc sine, arc 
11 ARC li cosine and arc tangent in radians. AS was 
__-_________ the case with the trig functions, a nonob- 


vious computation is required for one of the functions, and 
standard trig identities produce the other two. Since we al- 
ready have sine and cosine we could use Newton's method to 
compute the arcs. Alternatively, we could invert the recursive 
procedure used to compute the sine. For variety, however, we 
will leave these options as exercises and consider yet another 
method for producing a machine-independent computation of the 
arcs. 


A power series expansion for arc sine X is (Handbook, NBS, p. 
81]: 


x3 1*3*x5 1*3*5*x7 
X oW oa A O (15. 10) 
2*3 2*4x5 2*4*6*7 


While this series converges for all |X| < 1, convergence is 
slow if X is near one. For X < 0.5, however, the convergence 
rate is quite acceptable requiring at most about P/2 terms 
where P is the precision in bits. 


A power series expansion for arc cos(1-Z) (Handbook, NBS, p. 
81) is 


5 Z (1) (3) 22 (1) (3) (5) 23 
+ p a 


(22) 1+ 


41 (3) 42 (5) (2!) 43 (7) 3! 


ri | 


0 
e 
0 
lo — am am al 


This series converges more rapidly in the worst case that the 
previous one. It makes use of the fact that the parabolic 
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curve of the general form y = x? is a close fit to the bend in 


the sine curve. The power series expansion is actually for 
the deviation between the two. After range reduction, the 
worst case value is Z = 1 and convergence may be expected in 


about P - IoggP steps. Hence, we will define the arcs in terms 
of the power series for arc cosine. 


The two methods actually complement each other and together 
can provide a method of keeping the number of iterations below 
P/2. This is left as Exercise 15. 16. 


DEFINE ('ACOS (X) K, TERM, T!) 
PI. = 3.14159265358979 : (ACOS_END) 


ee ee eG ee Vg ee ay A MC RM Se gt CN A E gee ee AR PC e 
{ Entry point: Reduce the range to consider only quantities | 
( in the first quadrant. | 
A A A A NS 
ACOS ACOS = LT(X,O) PI. - ACOS (-X) 2 S (RETURN) 

eee ep ee ee ee ee ee ee ee 
{| Initialize for the loop starting with label ACOS_1. This | 
l is a power series for arc cosine. | 


ACOS = 1.0 

TERM = 1.0 

X = 1.0 - X 

K = 1 
ACOS 1 TERM = TERM * (2 * K- 1) * X / (4 * K) 

ACOS = ACOS + TERM / (2 * K + 1) 

K = K+ 1 

T = NE(ACOS,T) ACOS : S (ACOS_1) 

ACOS = SQRT(2 * X) * ACOS : (RETURN) 
ACOS END 


OO, ee OOOO enn nn ww 
| Arc sine and arc tangent are defined in terms of arc | 
| cosine. | 
Cee A a ee a oe Se DRE ERES 
DEXP ('ASIN(X) = (PI. / 2) - ACOS(X)') 
DEXP('ATAN(X) = ACOS(1. Y SQRT(1 + X * X))!) 


Names_referenced Name Type Where defined 
by ARC: SORT Function Program 15.6 
DEXP Function Program 14.1 
(| Program |! LOG(X,B) will return the log of X to the 
E 15.9 E base B. If B is null (or absent), the 
B LOG E natural log is returned. Given a method of 
_____________ obtaining logs to some base B, one can ob- 


tain a log to an arbitrary base B1 by the identity: 
LOG(X,B1) = LOG(X,B) / LOG(B1,B) 


and so the problem reduces to finding logs to some base B. 


A PEO RAM: 159.9 — DOG: -Page 3395 


If one were coding in assembly language, a natural choice on a 
binary machine would be base 2. This is because the exponent 
part of the real number is the integer part (actually the 
floor plus one) of the logarithm and is available with no con- 
putation. Moreover, the fractional part of the logarithm can 
also be plucked out of the exponent after successive squarings 
of the mantissa in a method described by Gosper in Beeler 
[1972, p.76]. 


Unfortunately,  SNOROLU cannot generally ‘get at! the exponent 
of a floating point number (except for SITBOL). An integer 
approximation to the base 10 logarithm can be found by coun- 
ting the number of characters ina string representation of 


the number. Thus SIZE(CONVERT(X, ‘INTEGER')) returns the 
ceiling of LOG10 X. If X is larger than the largest integer, 
however, it must be divided down. One can translate Gosper's 


method to operate on a decimal machine (which is what we have 
at this point) by raising the remainder to the 10th power for 
each succeeding digit. This is the method actually used. 


Ge I CA GN MCI EC GELS ee eee ge ep Feo Gas IMG KIT x Cc EMEN Be CM M CDM IM M pee IC M CREER | 
| LOG(X,B) will return the logarithm of X to the base B. | 
| LOG(X) will return the natural logarithm of X. | 
m ————— —————————————————— ''ÉOÓ—————————ts'— !dÓÀ—MDÀlÓ 
LN 10 = 2.3025850929940456840 
DEXP (' LOG (X, B) NE(B,0) CLOG(X) / CLOG(B) ;' 

' LOG EQ(B,0) CLOG(X) * LN 10  ;' ) 


A Ie EN Fg eh SY GM MMC ELM CCELI NC EMI C CELL ENE LT ee 
| CLOG will return the common log (base 10) of X. | 
pcc —GÉr——————— ÁJ———9Á————————————————————————À— E ge | 


DEFINE ('CLOG (X) FACTOR, T, K') : (CLOG, END) 


| M EIC CIE MD MM LC ——— cL «cL OE EG eG ep Mee Ged CM UE CREER e 
| Entry point: FACTOR is initialized to 1.0 with a precision | 
| equal to the precision of the argument X. Here we handle | 
| fractional cases (negative logs) in the event that either | 
| the original number was below 1.0 or the number X goes | 
| fractional as a result of the division at CLOG 4. | 


CLOG FACTOR = XY/ X 
CLOG. 1 X = LT(X,1) 17 X :F(CLOG 2) 
FACTOR =  -FACTOR 


| Here's the main loop. We determine the number of digits 
| (minus one) to the left of the decimal (K), which we may 
| regard as a crude approximation of the log. Reduce the 
| log of X by this much by dividing by 10 ** Kk. Then find 
l 


the log of this reduced quantity. 
A ee 


CLOG_2 EQ(X,1.0) :S (RETURN) 
K = SIZE(CONVERT(X,'INTEGER')) - 1 ° F (CLOG_4) 
EQ (K, 0) : S (CLOG. 3) 
CLOG = CLOG + K * FACTOR 
T = NE(CLOG,T) CLOG : F (RETURN) 
X = X/ 10. ** K 


CLOG_3 FACTOR = FACTOR / 10. 
X = X ** 10 : (CLOG_1) 


| Rua RNC RACES MEI NIME DISP MM DIC a OP egg NM A A M FIM ol E MACC DM DIC EC CM RI C MM 
| If X is larger than the largest integer, we come here. i 
| DORON ES ——————— ———————————————————— | 


CLOG 4 K = 10 
X = X/ 10. ** K 
CLOG = CLOG + K * FACTOR : (CLOG 2) 
CLOG END 
Names referenced Name Type Where defined 
by LOG: DEXP Function Program 14.1 
Epiloque 


Since the characteristic of a number to the base 10 can be ob- 
tained by inspection, the method above is suitable for com- 
puting logorithms on the four-function desk calculator. The 
reader is invited to try a few examples for himself. 
Another method for computing log is the power series: 

In 1*x = x - x?/2 + x3/3 - x*/U4 + ... (15.11) 
To use this power series one must reduce large x until they 
come close to 0. This can be done in part by the SIZE method. 
To bring x yet closer to 0, the identity: 

LOG (X) = 2 * LOG (SORT (X)) 


can be used. 


WC a ge ee ee 

|| Program 11 RAISE(X, Y) will raise X to the power Y. This 
E 15. 10 d! function is entirely redundant if the second 
B RAISE li operand of the ** operator is permitted to 
———Á d be REAL. It is not in many versions of the 
language and so RAISE must be included in our set. Indeed, 


its presence may suggest alternative methods for computing 
some of our functions (certainly SQRT). 


If one can raise some number, Z, to an arbitrary power, one 
can then define RAISE(X,Y) as: 


RAISE(Z, LOG(X,2) * Y) 
The number we will choose as Z is the base of the natural logs 
(normally designated e) and a special function EXP(X) will 
return e raised to the Xth power; EXP is normally called the 
exponential function. 


EXP(X) can be written as a Taylor series: 


1 + X + X2/2! + X373! + ... 
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which converges rapidly for X < 1. For X > 1, we simply obtain 
the integer part (the floor) I and use the rule: 


DEXP('RAISE(X,Y) = EXP(Y * LOG(X))') 


DEFINE ('EXP (X) TERM,K,T*) 
NAT BASE = 2.71828 1828459045 : (EXP. END) 


p--——————  ———— HERE A RM SS RM T MT A ea pene es c 
| Entry point for EXP. Reduce the range to [0,1]. | 
Er A A A 


EXP EXP = LT(X,0) 1. / EXP(-X) : S (RETURN) 
K = GT(X,1) CONVERT (X, INTEGER!) :F (EXP_1) 
EXP =  EXP(X - K) * NAT BASE ** K : (RETURN) 


a E KDE ERE PIU a RU C C CP MEUI I MC ep ee LM DC CD IQ FELIS EC QD gee oe ee 
| Initialize for the power series which is summed in the | 
| loop headed by EXP 2. { 
¡ARES A EE, | 


EXP_1 TERM 1. 


EXP_2 EXP = EXP + TERM 

TERM = TERM * X Y K 

T = NE(T,EXP) EXP :S (EXP. 2) F (RETURN) 
EXP END 
Names referenced Name Type Where defined 
by RAISE: LOG Function Program 15.9 

DEXP Function Program 14.1 

?7??7?2??2??2??2?72?7?2??2?2?7?2 222 ?02?? 22??? ???2 ?02??2 ?7??? ?2?2 ????? 270??? ?7? 77? 
2???2?2??7?2?22?2?212?2?2?7?2?7?7?77?72?7? | EXERCISES ????????7???????????7????? 
PTUR12:77270200222270222222?222722702221021012 0020? 20022213122 22 0222127 
| MU DEEP RAD CREER | 
| Exercise 15.1 | Rewrite COMB (Prog. 15.1) so that it com- 
LW putes iteratively. Do not separately com- 


pute numerator and denominator as this may result in an 
unnecessary overflow. Also do not divide numbers that are not 
divisible. 


[EEUU A 

| Exercise 15.2 | A rather unusual method for computing some 
A» combinatorial functions was shown to the 
author by Dennis Allen. It uses pattern matching to count 
combinations. The pattern matcher will undergo a number of 
attempts to match and this can be used (in fullscan mode) to 
compute (however inefficiently) some combinatorial functions. 
For example, let INC(.N) increment the variable N by 1. Then, 


&£FULLSCAN = 1 
N = 0 
S LEN(1) *INC(.N) FAIL 
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will count the number of characters in the string S. Rewrite 
COMB(N,M) so that it computes the function this way. 


Ng oe ey ee OR 

| Exercise 15.3 | What is the maximum number representable in 
AÑ» the combinatorial number system with nome N 
where SIZE(COMB_ALPHA) = L. 


CS ee A 

| Exercise 15.4 | Write a function COMBDE(K,N) which converts 
3 integer K into a representation in the com 
binatorial number system with nome N. If there are insufficent 
characters in COMB_ALPHA, COMBDE should fail. 


V7 00MM 

| Exercise 15.5 | Since SPITBOL does not allow redefinition 
AÑ» of operators, the INFINIP package (Prog. 
15.3) must be modified to run under that processor. (a) What, 
for example, would DIFF look like under such a modification? 
(b) How many statements in DIV would require modification? 


| ET | 

| Exercise 15.6 | Augment the INFINIP package by adding the 
AÑ ** operator. Do not multiply out the in- 
dicated number of times but use the rule: 


N N/2 REMDR (N, 2) (15.12) 
X - X * X 
Coy PETER 
| Exercise 15.7 | In the DIV procedure of the INFINIP 


AÑ» package, a better estimate of the trial 
quotient can be obtained by making the first digit of Y higher 
(better to be 9 than 1). This can be done by multiplying both 
X and Y by the same quantity. See Knuth [Vol. 2, p. 235]. 
Implement a scheme to make sure that the first digit of Y is 
at least 5 (requires only one additional statement if  SUBSTR 
(Prog. 3.9) is used). 


CS el ee ee 

| Exercise 15.8 | Write a function ROUND(X) which will return 
ANS the nearest integer to X (on ties, pick 
either). This requires three statements. 


Oat eo ee ee O 
| Exercise 15.9 | Let X, Y and Z be positive real numbers. 


(AA For what values of X will 
FLOOR (Y / X) < Z 
Using the relationships in (15.7) and the fact that 
N >M <==> N22M + 1 


for all integer N and M, give a step-by-step proof of your 
answer. 
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en, gee ee 

| Exercise 15.10 | To improve the speed of SQRT (Prog. 15.6), 
SY replace the three statements at label 
SQRT 1 by one. 


Gave eg E ES EC EN ee 

| Exercise 15.11 | Let e represent the error of an approxima- 
t-—————— tion & to the square root of the quantity 
x?. That is 


e - ? - x 
One iteration of Newton's method produces a new error. (a) 
Derive a formula which yields the new error E in terms of the 
old error e. (b) Assuming an initial error of 0.1, how many 


iterations will produce an error less that 10-20 ? 


oe UU SI 
| Exercise 15.12 | Given the formula for sine 3A, deduce a 
AÑ formula for sine 9A. Recode the SIN 


routine of TRIG (Prog. 15.7) accordingly. Can the same stop- 
ping criterion be used? 


E T 
| Exercise 15.13 | If the second statement of  SIN.() had 
AS Leen: 


A = SIN.(A / 3) 


a bug would have been introduced. For which values of argument 
A would SIN(A) then yield an incorrect value? 


| Se ey LOU MUR | 

| Exercise 15.14 | Compute ASIN(X) using SIN(A), COS(A) and 
AAA Newton's method in a manner similar to 
SORT. Use X as the original estimate of ASIN(X). 


p ————————M——S. 

| Exercise 15.15 | To express arc sine recursively, one may 
AS use a half-angle (or fractional angle) 
formula in order to reduce the range. One such is: 


SIN(A / 2) = SQRT((1 - SQRT(1 - SIN?A)) / 2) 


(a) Express ASIN(X) in terms of ASIN(X / 2). (b) If one were 
to use the recursive formula to implement ASIN (X), what stop- 
ping criterion would one use? 


E77 LT eee | 
| Exercise 15.16 | Using the power series of (15.10), modify 
t— ACOS as suggested in the text. 


e ne TU RU 

| Exercise 15.17 | In LOG (Prog. 15.9) we depend on being 
AV able to convert REALS to INTEGERS for all 
reals in the range (0, M). That is, we suppose that the max- 
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imum integer is greater than M. What is M? (Hint: the answer 
is not 1019.) 


Ce X QU INCIDERE RA A EN 

| Exercise 15.18 | It is not strictly necessary to insert 
3 numeric constants into the programs TRIG, 
ARC, LOG and RAISE. Rather, they may be computed by ap- 
propriate calls on the defined routines. Modify the routines 
so that they compute the constants. 


Ce gt ee ee 

| Exercise 15.19 | Assume you are writing an assembler and 
AS must construct a : real number in its 
machine form for a binary machine with 27 bits of precision. 
Given other functions in the book (Chapter two), this reduces 
to the following problem: given a non-zero real number X, find 
the exponent N and integer I such that 226 < I < 227 and 


X = (approx.) 2 * —— 


Using LOG (Prog. 15.9), N and I can be computed in three 
statements. What are they? 


AAA TT n UN 

| Exercise 15.20 | In order to make the random number 
NV generator (RANDOM, Prog. 16.1) go back- 
ward, we need to be able to find the inverse of a multiplier. 
That is, we need to solve for X in: 


X*R = 1 (Mod M) 


This can be done by noting that: 


X = R (Mod M) 


Assuming that M-2 multiplications may be too time-consuming, 
work out a method whereby only 2*LOgs(M-2) multiplications are 
required. 


KT a phe WIL PRSE | 

| Exercise 15.21 | If RAISE (Prog. 15.10) is used in SPITBOL 
tL———————————————4 and if a DREAL argument is given to the 
function EXP, the returned value will be DREAL but will not 
have DREAL accuracy. Why? How can one correct this deficiency 
and still return a single-precision result if a REAL is given 
as argument? (Hint: the answer requires modifying one 
statement.) Starting with an estimate, 


Hr 
| =a 


| 


Urn 
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| 


3  tochastic or random strings have many applications 
ty within the computing sphere of activity. Some exotic 
tf uses include poetry, choreography, play and brand-name 
c——2| generation, cryptographic and linguistic analysis, and 
C——J even police-patrol scheduling [Aberg 1974]. Simulations 
and game-playing also make critical use of the computer's 
ability to generate near random sequences. More mundane ap- 
plications include algorithm testing and timing. 


Digital computers have the power to produce prodigious quan- 
tities of what appear to be random strings and/or random 
numbers. However, if pressed to define precisely what is meant 
by the term 'random'! one must be careful. For example, Table 
16.1 contains two groups of 'random' English words. One group 
was formed by selecting words at random from a novel. The 
other group was formed by selecting dictionary entries at ran- 
dom. It should be immediately evident which source produced 
which group. Yet both groups have at least some claim to being 
called 'random English words'. 


| Table 16.1 One of the groups of words 
{ shown below was obtained by randomly 
( selecting from entries in a dictionary 
{ and the other by selecting words from a 
| novel. Is it obvious which is which? 

l 

I 


Source A Source B 
[oer metere seid mE ee I 
l your dialectition | 
| a Jemappes | 
| the profligate | 
| and disenfranchise | 
| Hell opaque { 


On a ee reel 


To make the notion of randomness more precise we speak of a 
sample space containing a possibly infinite collection of 


things. A random selection is a selection of a single item 
from the sample space with the proviso that all items have an 
equal chance for selection. In the example above, one sample 


space was the set of dictionary entries which approximates the 
set of distinct words of the English language. The other sam- 
ple space was the set of words ina novel which approximates 
the totality of all words actually used to communicate thought 
using the English language. Note that a sample space may have 
repeated items such as the novel or they may all be distinct 
as in the dictionary case. Note too that a sample space may 
be completely unstructured as in the two examples given. This 
may be contrasted with a sample space obtained by five tosses 
of a coin in which the sample space is a well-structured set 
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containing 32 combinations, each describable by a sequence of 
five binary digits. 


Co; et 

Il Program {| Random strings are constructed from random 
li 16.1 li numbers and so this is what we must obtain 
(|! RANDOM 1! first. RANDOM(N), where N is a positive in- 
AA teger, will return a  'random' number from 


the sample space (1, 2, ..., Nj. For example, if RANDOM (3) 
were called 10 times the sequence produced could be: 


1 33 3 2 3 12 1 1 3 


If the argument N is O, the number returned will be of type 
REAL chosen from the sample space [0,1) which is the interval 
on the real line from 0 [inclusive] to 1 (exclusive).* Calls 
to RANDOM with different arguments may be intermixed without 
adversely affecting the generating process. 


Since the numbers are produced by a deterministic process they 
are not truly random but only apparently random. It is con- 
ventional to term such processes pseudo-random.  Pseudo-random 
sequences have the very convenient property of being 
repeatable. This can be important in debugging or in studying 
certain effects in greater detail. If one wishes to obtain a 
different sequence one can set the variable RAN VAR to some 
other value in the range (1, 2, ..., 414970}. For game 
playing, it is sometimes necessary to initialize the random 
number generator to a value which is indeed unpredictable. For 
such purposes one can use the clock. 


| A CS EIC NEC I MEC C CC CDM Y AN 
( RANDOM(N) will return an integer uniformly distributed on | 
| 71.2), > If N=0, it will return a real uniformly | 
| distributed in the interval (0,1). { 
ee — —— —— ————— ——— ———— —— A IS | 

DEFINE (' RANDOM (N) !) 

RAN VAR = 1 : (RANDOM END) 
Ge ee ne Pee ee ae ey Oe a E EN | 
| The REAL is produced in any case. If an integer is wanted, | 
| the REAL is multiplied by the proper range. Note that | 
( CONVERT Truncates rather than rounds. | 
PICO" 2 ———————————! Á—————— ——'uÓnÓ———— — aM | 


RANDOM 


RAN VAR = REMDR(RAN VAR * 4676, 414971) 
RANDOM = RAN VAR / 414971. 
RANDOM = NE(N,0) CONVERT(RANDOM * N,'INTEGER') + 1 
: (RETURN) 
RANDOM END 


*Actually, this is a slight fiction. The number of reals 
representable by the machine is finite, whereas the number of 
reals in the interval is (uncountably) infinite. The intent 
is to approximate this interval. 
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Epilogue 


RANDOM (N) belongs to a class of generators called the 
congruential type first proposed by Lehmer [1951]. Given some 
integer R in the range 0 < R< M where M is some integer 
called the modulus, the next value of R (which we denote by 
R') is obtained by the computation 


R'! = R*A (Mod M) 
Or, in SNOBOLU notation 
R' = REMDR(R * A, M) 


where A is some positive integer called the multiplier. The 
numbers will begin to repeat themselves after a certain period 
governed by R, A and M. For example, if M=10, A=7 and R=3 
(thoroughly impractical values) the sequence of R's becomes 


3. 4 7 9 3 1 7 9- 3 wae 


repeating themselves every four numbers (the period is said to 
be four). A random real number in the interval is then ob- 
tained by dividing R by M. 


The congruential method is extremely important historically 
because the operation 


R' = REMDR(R * A, M) 


can be accomplished with one multiply instruction where M is 
the natural modulus of the machine (For example on the IBM 360 
the natural modulus is 231). Use of the natural modulus is 
attractive from an efficiency standpoint but is machine depen- 
dent and can't be used in SNOBOLU anyway because the computa- 
tion will be regarded as an error (arithmetic overflow). 


The sequence of R's will consist only of integers relatively 
prime to M. This means that a period equal to M where M is a 
natural modulus is impossible. A way around this is to use 
the so-called mixed congruential generator first proposed by 
Greenberger [1961] in which the formula 


R' = R*A*C (Mod M) 


is used. For correctly chosen values of A and C, the R's will 
range through every number in the set (0, 1, ..., M-1). 


Another method of obtaining long periods is to use a prime 
modulus. If M is prime, then for certain values of A the 
generator: 

R' = R*A (Mod M) 


will cause the R's to cycle through every integer in the range 
Ve 25 s, WT]: Such an A is called a primitive element of 


the field of integers modulo M (see for example, Barnard and 
Child [1955], p. 438). 


The prime-primitive pair must be such that the A*R never over- 
flows the machine. If the maximum integer is, for example, 
231-1 (as it is for most 32-bit machines), then it will be 
sufficient that A*M < 23!. A list of prime-primitive pairs is 
given in Table 16.2 together with an indication of the number 
of bits of arithmetic required to avoid overflow. The choice 
of prime-primitive pair for the function RANDOM was based on 
the observation that most SNOBOL4's can represent all positive 
integers below 231, 


an upper bound in terms of a power of 2. 


| 

| 

| Smallest | Smallest 

| Prime Primitive Power | Prime Primitive Power 

| Modulus Element of 2 | Modulus Element of 2 

1 M P > M*P | M P > M*P 
ÓN AA A ee | 
| 127 12 211 | 10657 735 223 1 
{ 127 29 212 { 10657 824 22¢ | 
| 211 35 213 { 4409 4035 225 | 
| 211 41 219 i 19423 3088 226 | 
| 491 59 215 | 10657 7367 227 | 
| 491 84 216 { 24281 9713 228 | 
( 1103 117 217 | 29443 13300 229 | 
| 1103 156 218 { 3997 1 20411 230 | 
{| 1223 421 219 y 414971 4676 231 | 
{| 1987 451 220 | 532333 8705 233 | 
| 1987 1017 221 | 1299709 16322 235 | 
| 2741 1148 222 { 1798963 160658 239 | 
 -——— — — ———ÉÓ—————— Á————— a co En MM PM A | 


Tests for Randomness 
One might suppose that there existed a single, simple test for 
randomness which could be applied to some psuedo-generator to 
determine a coefficient of randomness. Unfortunately, no such 
single test exists. It is interesting to note that if one had 
a test to determine whether a sequence was truly random that 
test could be used to produce, by elimination, a truly random 
Sequence. We would then have a contradiction in terms, since 
an algorithric process can never produce truly random numbers. 
Rather than a single, all-powerful test for randomness, there 
exists many tests each oriented toward detecting violations of 
important characteristics of random behavior. Knuth (Vol. 2] 
and Canavos [1967] describe a number of such tests. Those 
outlined here are from Canavos and have actually been applied 
to the generators mentioned in this chapter. 
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The most common test seems to be the bins test and seeks to 
answer the most obvious question: Is each of the B integers 
from RANDOM(B) equally likely? RANDOM(B) is called succes- 
sively N times where B is the number of bins. The number of 
numbers appearing in each bin should average out to N/B. But 
the distribution over the bins cannot be expected to be per- 
fectly flat or one would suspect nonrandom behavior. One can 
measure the extent to which the distribution deviates from 
perfection and the deviation proper for a random generator is 
given by the so-called Chi-squared distribution. The number 
of bins, B, is selected so as to maximize the power of the 
test and depends upon the number of samples taken. For exam- 
ple, for N = 1000, the number of bins suggested is 50. 


Another popular test for randomness is the correlation test 
which determines whether numbers a given fixed distance apart 
are correlated. For example, in the Canavos series, correla- 
tion is tested for distances of 1 through 8. The extent to 
which the numbers are correlated in any given sequence can be 
calculated. Random generators would tend to produce zero cor- 
relation in the long run, but in the short run they are expec- 
ted to produce a small correlation. Observed correlations 


above or below this level are suspicious. 


When RANDOM(2) is called repeatedly, the binary sequence 
produced can be considered to be like the head-tail sequence 
produced by flipping a coin. Questions one might ask are: Is 
heads just as likely as tails? This is answered by the bins 
test. Another question is: Will heads follow heads as often 
as it follows tails? This is answered by the correlation test. 
A classic coin-tossing question not answered by these tests is 
the following: If K heads in a row are produced, is the next 
toss more likely to be a head or a tail? One might fear that 
an artificial system of producing random numbers might be too 
'round'! and not produce enough long sequences or be too 
'angular' and produce too many. Such questions are settled by 
the so-called runs test. A run is a sequence of heads bounded 
on both sides by a tail or a sequence of tails bounded by 
heads. The number of runs of length 1,2, 3, ... is measured 
and the resulting distribution should close to that obtained 
from a random distribution. Like the bin test, the chi-square 
formula is used to determine if the distribution is 'too good! 
or 'too bad!. 


Other Generators 

It is frequently useful to know of other genrators so that if 
the results of one generator or type of generator becomes 
suspect, another may be plugged in. The following extremely 
portable generator was suggested by Kruskal [1969]. 


R' = R* 125 (Mod 213) 


The one multiplication by 125 can be replaced by three mul- 
tiplications by 5 so that provided the machine can contain 5 * 


ES SE ee ee a EER A A CP ED ED ES TED A A AA A O O ED AA yxp EE GENES O O ar 


213 as an integer, the computation can be done without over- 
flow. Unfortunately the period is short. 


Another method is to construct a random number generator 
according to a recipe suggested by Knuth (Vol. 2, p. 155-156]. 
One such generator is: 


R' = R * 3141 + 110795 (Mod 524288 = 219) 


Another approach is to use a standard generator with multiple 
precision arithmetic. One generator endorsed by Coveyou and 
Macpherson [ 1967] (they do not endorse many) is : 


R' = R * 25214903917 (Mod 235 = 34359738368) 


To perform the arithmetic within SNOBOLU on the IBM 360, three 
integers are needed to contain the multiplication. This will 
slow the computation and increase the complexity on the 
program but the random numbers should be quite random. 


E V VIUA 

(! Program || There are techniques for combining random 
tt 16.2 li number generators to produce degrees of ran- 
B RAMM 1! domness higher than either operating alone. 
E _AE A _—_——— One method, proposed by MacLaren and Mar- 


saglia [1965] is to let one random generator shuffle the out- 
put of a second random generator. This is done in RAMM(N) 
below which will behave like RANDOM(N) except that its 
statistics will be better. It uses a Knuth generator to shuf- 
fle the output of RANDOM. 


DEFINE (' RAMM (N) K*) 
A a a SE CIE UI 
{| The following two OPSYN's make the subroutine plug-in-able | 
| to any routine already using RANDOM. | 
———— —————————————— —————————c—X———— sis — | 

OPSYN ('RANDOM. ! , 'RANDOM'!) 

OPSYN ('RANDOM! , 'RAMM!) 


E AA AAA TES | 
| Initialize the RAMM array (RAMM_A) with random numbers ob- | 


| tained from RANDOM. (). | 
A A A | 


I = 0 
RAMM A = ARRAY('0:99') 

RAMM 1 RAMM AXI» = RANDOM. (0) :F(RAMM END) 
I =- I+] : (RAMM_ 1) 


AAA A a ees ee oe ee 
| Entry point: Select an element K of RAMM A at random. | 


| Return this value and fill up the entry with a new RANDOM | 
| value. 1 
ORE ele NES ane IR A INTR REN RENE DR EN EE RE DERE NER Ne IEEE REIR, 
RAMM RAM VAR =  REMDR(RAM VAR * 3141 + 110795, 524288) 

K = CONVERT((RAM VAR / 524288.) * 100,'INTEGER!) 

RAMM = RAMM A<K> 

RAMM_A<K> = RANDOM. (0) 


RAMM = NE(N,0) CONVERT(RAMM * N, 'INTEGER') + 1 
: (RETURN) 
RAMM END 
Names referenced Name Type Where defined 
by RAMM: RANDOM * Function Program 16.1 


* indicates name is referenced in the initialization section. 


ás 


| Program li A natural application of a random number 
{| 16.3 l1 generator is to produce random permuta- 
(|! RPERMUTE || tions. This is easy to do in SNOBOL4. 
Ba a: RPERMUTE(S) will return a random permuta- 


tion of the string S. 


DEFINE (*RPERMUTE (S) T!) : (RPERMUTE_END) 
RPERMUTE S LEN(1) . T = cF (RETURN) 
RPERMUTE  POS(RANDOM(SIZE(RPERMUTE) + 1) - 1) 
+ ERE : (PPERMUTE) 
RPERMUTE END 
Names referenced Name Type Where defined 
by RPERMUTE: RANDOM Function Program 16.1 
SSS 
|! Program |! A one-way cipher is a notion of Needham 
E 16.4 1] first introduced in published form by Wilkes 
(|! ONEWAY 11 [ 1972]. The function ONEWAY (S) where S is 
Dam a eS some string will return a string the same 


size as S having the property that it would be computationally 
prohibitive to compute S or some other value S' such that: 


ONEWAY (S) = ONEWAY (S') 


That is, even knowing everything about ONEWAY to the extent of 
having a listing of ONEWAY in front of you, it is still im- 
practical to compute the original argument from the output 
obtained. 


One-way ciphers are used in password protection schemes as 
follows. A user types in his password S. The system applies 
ONEWAY(S) to obtain a cipher C. C is then looked up ina 
table. If a match is found the user is identified and ap- 
propriate privileges are assumed. This protects against 
accidental or malicious revelation of the table's contents. 
That is, if one, or even all, such ciphers were revealed it 
would not help a thief. He must know the original password or 
any password that would yield the same cipher as the original, 
but this he presumably cannot oktain. 


Without such a protection scheme, the collection of passwords 
is always in jeopardy. In one instance, the message of the 
day for a time-sharing system that will go nameless became, 
quite by accident, the list of passwords. As one wag put it, 
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the most confidential file in the system suddenly became the 
most public file. 


Other applications of ONEWAY are indicated in the chapter on 
games. 


a A A OR C CCCII MCI CC A Sg ee gg gee Math ee ME a ES NAO 
| ONEWAY(S) will return a one-way cipher of the alphabetic | 
| string S. | 
a ES A A 


DEFINE ('ONEWAY (S) A,SIZE,C,K,SB') : (ONEWAY END) 


adi SF CC V MADE XN ICON MMC CM GE ME E C D DI CC OE ae nee KI IE M CC pe ee ee ee y 
| Entry point: Initialize the random number generator (by | 
| setting RAN VAR) and set the alphabet A. The length of A | 


| must be a power (PWR) of 2. | 
AN | 


ONEWAY RAN VAR = 1 
A = ‘tABCDEFGHIJKLMNOPORSTUVWXYZ012345! 
PWR = 5 


a a a eS ee ee ee ee PIC I: c a ae oe Tye a eee ge MEC eg Ea ee” Oe 
| Now, for each character (C) within (S) determine its posi- | 
| tion (K) in the alphabet (A). Obtain K's binary equivalent | 
| and append it to the growing string of bits, SB. Also, | 

l 


| use K to modify the 'seed! of the random generator. 
re —-—O———— —— —— H—— ————— A E —— — —nÓ 


ONEWAY 1 S LEN(1) .C = :F(ONEWAY 2) 
A 3K C :F (ERROR) 
SB = SB LPAD(BASEB(K, 2), PWR,'0') 
RAN_VAR = REMDR(RAN_VAR * 2 ** PWR + K, 414971) 


: (ONEWAY_1) 


E | 
{| Now we replace each '0' by a '01' and each '1' by a '10'*, | 
{ randomly permute the string, and extract the first half of | 
| it. l 
i V————— E E A EE ces O E E E E simon cms ie ETE E | 
ONEWAY 2 
RPERMUTE (BLEND (SP, REPLACE (SB, '01',' 10!))) 

* LEN(SIZE(SB)) . SB 


| arc rs ECC II CENE A M CIMA NCC MCCC I LC AGMEN MI IC I E d IC ICE ee 
| Now repack the string from its 1-0 form into something | 
| more amenable. | 
EAS 


ONEWAY 3 


SB LEN (PWR) . S = : F (RETURN) 
A  POS(BASE10(S,2)) LEN(1) . C 
ONEWAY = ONEWAY C : (ONEWAY 3) 
ONEWAY END 
Names referenced Name Type Where defined 
by ONEWAY: LPAD Function Program 3.2 
BASEB Function Program 2.4 
RPERMUTE Function Program 16.3 
BASE 10 Function Program 2.5 


BLEND Function Program 3.7 
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Epiloque 


How difficult is it to break the cipher? No one knows. There 
is no guarantee that someone will not come up with an al- 
gorithm to quickly find the inverse of ONEWAY, it is just not 
very likely. 


Essentially the initial argument regarded as a bit string is 
both used to 'seed' a random generator and is permuted by the 
generator. The straightforward way of cracking the cipher is 
to assume a final value for the generator and work RPERMUTE in 
reverse by running RANDOM in reverse. If the results are found 
to agree, the cipher is cracked. This points up a weakness of 
ONEWAY as presented here. We normally wish the number of 
guesses required to be of the order of the number of  combina- 
tions of the original string. If this were the case, longer 
passwords would prove to be more difficult to discover. But 
the number of different modes of operation for RANDOM are 
relatively small (414970). Hence, if added security is wanted, 
a generator with a longer cycle time (such as RAMM) should be 
used. Even so, the computation required to permute a half 
million strings in the manner indicated is sufficiently for- 
midable that the writer is confidant that no one will discover 
the original string used to produce: 


' BFDDGL' 


Of course, other techniques can be used to produce one-way 
ciphers. See Evans, et al [1974] and Purdy [1974]. 


Co eS MS 

| Program || RCHAR (CONTEXT) will return a random charac- 
6! 16.5 N ter. The intended sample space is the set 
ii RCHAR E of all characters following the CONTEXT 
—_ provided as argument. For example, 


RCHAR('BR*) will return 'A' much more frequently than, say, 
'B' because 'A' is much more likely to follow the characters 
"BR's 


In order to write RCHAR we could pump it full of statistical 
information concerning the English language. A more flexible 
(and easier) ep»^roach is to let the user supply his own 
language sample (called the corpus) and use pattern matching 
to search for a likely subsequent character. In this way we 
do not limit ourselves to English nor, indeed, even to natural 
languages. 


To obtain a likely successor to, say, 'BR' within a language 
corpus, we may look up each occurrence of 'BR' and choose ran- 
domly from among each successor. Another aporoach is, starting 
at some random point within the string, to scan for the first 
occurrence of 'BR' and then return the character which fol- 
lows. This latter technique is much faster than the former, 
but will produce statistically incorrect results. Thus, if 
the corpus is 1000 characters long, and if 'BR' occurs three 


times in positions 500, 510 and 910, then the random probe and 
forward scan would mean that the 500 or the 910 would be 
picked up relatively frequently, but that the 510 would have 
an extremely small chance of being selected. 


A compromise between these two choices is to scan the string 
for the first K instances of the CONTEXT and to choose a ran- 
dom character from among the K characters which followed. This 
greatly reduces the time required to process CONTEXT's which 
occur frequently, such as RCHAR('E'), while maintaining good 
statistics for other kinds of CONTEXT's. The encoding of RCHAR 
given below will use a compromising value for K of 2. 


A eg ee ee pe re EE ye | 
| RCHAR will return a random character following the CONTEXT | 
I| given as argument. If none such exists, RCHAR will fail. | 
[m CP CUI ——— Ó— —— Iac ———— e———— MÀ: 
DEFINE (' RCHAR (CONTEXT) BX,C,P,N,RC1!) 
PA A re ge aig Fee fag Ne ATA EC LC Oh OR TO CECI A A a 
| Initialization: Read into R_CORPUS the language corpus on | 
I which the statistical characteristics of RCHAR will be | 


| based. | 
EA a ee WU Rn DEN eee 
RCHAR 1 X = TRIM(INPUT) :F(RCHAR END) 
IDENT(X, 'END!) :S(RCHAR END) 
R CORPUS = R CORPUS X * ! : (RCHAR_ 1) 
— at 


—————————— — sn 
| Entry point: Prepare in P a pattern suitable for scanning | 
I the text beginning at cursor position N looking for | 


| CONTEXT.  BREAKX is used to make the scan rapid. | 
AA ARA IS ee II AI IS NE O AN | 


RCHAR CONTEXT LEN(1) . C : F (RCHAR_2) 
PX = BREAKX(C) 
RCHAR_2 P = POS(0) TAB(*N) BX CONTEXT LEN(1) . RCHAR 


rg ge GOS ey ae MD CMM CE ND ALMC NEC MEC CENE oe ee eae 
| Pick up the first random character fitting the context. | 
| Scanning begins at some arbitrary point N. | 
| — ———————————Ác—————————— —————— — ''—— s 


N = RANDOM(SIZE(R_CORPUS)) - 1 

R CORPUS P :S (RCHAR. 3) 

N = 0 

R CORPUS P :F (FRETURN) 
r A AS. 
| Here to pick up the next adjacent random character. The | 


| first is saved in RC1. | 
| PERDE UTERE A CTUM EU MEUM SU ENDE EMEN MEE RR NER A EE A | 


RCHAR 3. N = N#1 
RC1 - RCHAR 
R CORPUS P :S(RCHAR 4) 
N = 0 


R CORPUS P 


Ee E eg Cee CA CC CINES IN ee ge ae 
| Here to select from between these two. | 
A RE E E eed 
RCHAR_4 RCHAR = EQ(RANDOM(2),1) RC! :(RETURN) 

RCHAR_END 
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Names referenced Name Type Where defined 
by RCHARs RANDOM Function Program 16.1 
BREAKX Function Program 8.2 
NI LÀ e e e 
(| Program |! RWORD is an obvious application of  RCHAR. 
E 16.6 E RWORD(K) will return a random word with 
li RWORD E characteristics similar to other words in 
a the given corpus. K is a small whole number 


indicating the extent to which context is used in forming the 
result. That is, the next character chosen depends on at most 
the last K characters already chosen. Selection begins with 
RWORD 'seeded' with a blank. 


| Table 16.3 Below is a list of random names 
| produced by RWORD(K) from a list of 700 names 
| (R.CORPUS in RCHAR). Words chosen were in the range 
| of 
I 
I 
| 


| 

| 

{ 

5 - 10 characters but were otherwise not pre- | 
selected. | 

| 

K = 0 K = 1 K = 2 K = 3 | 

[oec Tee a E a a l 
{| Rnztn Faundobr Joher Alton | 
| Eebfer Einakicl Thelmsti Vigan | 
{| Uoaer Kolin Gringtock Young | 
{| Earlho Fssmched Clouth Rosen | 
| Meeofr Paubin Mcdorg Haekstra | 
| Asnegrmnmh Mormer Jordawm Repsherty | 
| Ckwaig Feymet Paudelly Haekstraun | 
{ Kninhaaf Madicos Franic Walton { 
| Agajfoope Halitun Cloobs Bartoliti | 
| Hfhclunc Mchoskyr Panscher Thatchek | 
{ Usirollbh Ralmrollan Thaman Caseyman | 
| EEdhmeucc Ffrrr Mowski Walker | 
| Lasdctn Linestz Spaglema Lopiparo ( 
| Ghsiafee Reawstz Loobs Shallisi | 
I| Riesl Gelllar Eiter Ruscher | 


Table 16.3 contains a number of random words generated by, 
RWORD when RWORD was given a corpus of 700 surnames culled 
from an addressing list. One can see clearly the effects of 
increasing K as well as the influence of the type of corpus 
chosen. The names for K-2, for example, would be quite accep- 
table in outer galactic society. RWORD, using a different 
corpus, could be used for brand-name generation. The name 
EXXON was purportedly chosen in this way. 


> Ge cee A ee > See ae em AD PEED > ree ae SED ED AED A A ee eee AAA AA ee. 


DEFINE ('RWORD (K) CONTEXT!) : (RWORD_ END) 
MAA AER. | 
| Entry point: Initialize RWORD with a blank. | 
AE AA AAA A AAA 


RWORD RWORD = ! ' 


Ga Ug MERECE DS NE EE ee 
| Use the last K characters of RWORD (or all of RWORD if it | 
| fails to contain K characters) as context for the next | 
{| character. i 
(OSS Ee —— — ——Á eee 
RWORD_1 CONTEXT = RWORD 

RWORD RTAB(K) REM. CONTEXT 

C = RCHAR (CONTEXT) : F (RETURN) 

RWORD = DIFFER(C,' ') RWORD C :S (RWORD_ 1) 


SSS Se 
| Falling through means we encountered a blank. Remove the | 
| initial blank from RWORD. If RWORD is null, try again. | 
"cc —————ÀÁÁÁ'Ár'—— a Á—— À—] —0—— | 


RWORD ' ' = 

IDENT (RWORD) :S (RWORD) F (RETURN) 
RWORD END 
Names referenced Name Type Where defined 
by RWORD: RCHAR Function Program 16.5 
pU a UE 
(|! Program || RSELECT will make a random selection of one 
|1 16.7 E of a sequence of strings passed to RSELECT 
(| RSELECT || as argument. The first character is taken 
E GA — to be a break character (BC) separating 
strings in the sequence. Thus, RSELECT('|A|BIG{CAT') will 


return each of  'A', 'BIG* and 'CAT' with probability one- 
third. An optional integer weight enclosed in sharp signs may 
be vlaced at the beginning of any alternation. Thus, 


RSELECT (' [(A| 434BIG|CAT') 
will select 'BIG' three times out of five. 


RSEIFCT will be used as a utility routine by several programs 
which follow. 


DEFINE('RSELECT (S) WT, WTS,ALT,CODE,I,CODE, SSAVED, BC ') 
RSEL TBL =  TABLE() e (RSELECT, END) 


pe TTA E vU eI eI T m ER E | 
( Entry point: All previously-seen arguments had been placed | 
| ina table (RSEL TBL) together with code to be executed. | 
| In this case we simply execute the code. | 
A A a ne A | 
RSELECT CODE = RSEL_TBL<S> 
DIFFER (CODE, NULL) : S<CODE> 

ña en 
| If S had not been seen before, we fall through here. We | 
| first save the string (SSAVED) and determine the break | 
| character (BC). For each alternate (AIT), its weight (WT) | 
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| is determined and added to a subtotal (WTS). CODE is | 
| produced which will assign the alternative to RSELECT if | 


( the numbers are right. | 
LK a a AELA a | 


SSAVED = S 

S LEN(1) . BC = : F (RETURN) 
RSELECT_1 

WT = 1 


S  POS(0) '£&' BREAK('#') . WT '#! = 
S  (BREAK(BC) | REM) . ALT = 
WIS = WTS + WT 
CODE = CODE ' ; RSELECT = LE(I,' WIS ') * 
QUOTE (ALT) ' :S(RETURN) ! 
S BC = :S (RSELECT. 1) 


+ 


Geyer CU ee eee ee ae S ee a i ae ee et a Sy ae fe ge ee eS ee a Oe 
| Falling through means we're done. We simply prefix the | 
| code to assign a random number to I, fill the table and | 
| try again. | 
nn A A 


CODE = ' I = RANDOM(' WTS ') * CODE 

S = SSAVED 

RSEL_TBL<S> = CODE(CODE) :S (RSELECT) F (ERROR) 
RSFLECT END 
Names referenced Name Type Where_defined 
by RSELECT; QUOTE Function Program 3.16 

RANDOM Function Program 16.1 

Epilogue 


An interesting implementation aspect of RSELECT is that it 
compiles code the first time through for any given argument. 
This makes sense for a random generator since it may be called 
many times with the same argument and compiling code, as shown 
here, greatly increases the speed of subsequent calls. 
Moreover, the program is not made very much more complicated 
because of this; in fact, the construction of CODE actually 
saves a second pass over the string and in this sense serves 
to produce a more simple program. If space is a greater 
consideration than time, See Exercise 16.5. 


RR | 


11 Program 11 RSENTENCE (ARG) will generate and return a 
E 16.8 E random sentence according to a grammatical 
If  RSENTENCE |i description read in during initialization. 
E _ _ PA«<<«<«<á— —À The argument ARG represents a string pos- 


sibly containing syntactic variables which are expanded 
according to the grammar. As a simple example, let the input 
be 


<SENT>: <=the <NOUN> <VERB> the <NOUN> 

<NOUN>: :=boy{man|{dog{<NOUN> who <VERB>s the <NOUN> 
<VERB>: :=bite|walk {pet | lick| smack 

END 


Then a call such as RSENTENCE ('<SENT>.*) will generate, among 
an infinite number of sentences, 


the dog bites the man. 

the man walks the dog. 

the man who walks the dog who licks the boy smacks the boy 
who bites the dog. 


Identifiers in pointed brackets (here shown in uppercase for 
ease of distinction) are termed syntactic variables. Alter- 
nates are separated by vertical bar (|). Though these special 
characters may not appear within the text it is not difficult 
to provide an escape convention so that they can be (See Exer- 


cise 16.9). 


When a syntactic variable is expanded it is replaced by one of 
its alternates randomly and this alternate may in turn contain 
other syntactic variables which are also expanded. This 
process may never halt (see the Epilogue). 


The meta-language used for describing the grammar is the so- 
called Backus Normal Form (BNF) which is also referred to as 
Backus-Naur Form since the form is not normal (non unique) and 
Since Naur was a cohort of Backus. The meta-language is a bit 
awkward (the first four meta-characters are redundant provided 
syntactic variables do not contain ='s) but has the convenient 
property of being commonly understood. 


Another feature of RSENTENCE is that an expression in paren- 
theses is treated as a SNOBOL4 expression. It is evaluated 
and inserted into the text stream. Also, an identifier between 
-'s is expanded like a syntactic variable but will also have 
the side-effect of assigning the result of the expansion to 
the indicated variable. Thus 


<THING>: :=rose|(tree(turkey 
XSENT1»::- A =THING= is a (THING) is a (THING). 
>::= The word '=THING=" has (SIZE(THING)) letters. 


will produce for <SENT1>: 
A rose is a rose is a rose. 

with probability one-third. An example of <SENT2> is 

The word 'turkey' has 6 letters. 
Other miscellaneous features of the program are as follows. 
Continuation is represented by a line not beginning with a 
'<'. Weights can be associated with alternation using the int 
notation of RSELECT. 
One application of RSENTENCE is test-data generation for com- 
pilers and other processors expecting stylized input (an early 


version of RSENTENCE was used to find bugs in SNOBOLU itself). 
Another application is in producing nonrepetitive messages in 
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an interactive environment. For example, in game playing, a 
variety of Sarcastic remarks can provoke an otherwise 
apathetic player into a competitive state. RSENTENCE has been 
used in the production of prospective topics for a discussion 
group. While not all topics randomly generated are directly 
usable, they are often sufficiently suggestive and suf- 
ficiently numerous that random generation followed by a cul- 
ling process, such as the previously described brand-name 
selection, becomes an effective technique. 


Yngve [1962a] suggests that such programs coupled with a full 
and valid grammar, solve one aspect of the problem of machine 
translation, viz. the target-languege generation end. One must 
realize, however, that RSENTENCE, by itself, is limited almost 
exclusively to context-free generations and hence to very 
restrictive grammars. To aid in the machine translation study, 
RSENTENCE must be considerably enhanced. One such enhancement, 
suggested by Yngve is given in Exercise 16.8. It must also be 
realized that it is not merely sufficient to generate sen- 
tences having a variety of syntactic constructs, one must 
actually be able to perform transformations from one form into 
another. This is considered more fully in RSTORY (Prog. 
16.11). 


DEFINE ('RSENTENCE (STACK) VAR, EXP, S, TEXT!) 
gs A ES E Ee ee Se ee CDM E DO ee eae 
| Pattern initialization: | 
A A E AE AA II | 
SYN.VAR = POS(0) '<* ARB . VAR '!>! 
SNOBAL.EXP = POS(0) '(' BAL('(45)','"' "tm , EXP tye 
ASGN.VAR = POS(0) ‘= ARB . VAR '=! 
LITERAL. TEXT = BREAK ('<=(') . TEXT 


E ogg! thea SM UID MCCC ERR SCR LC CE pe ne ee LM ee ee 
| Read in the grammar and enter the alternative lists into a | 


| table (RSENT_TBL) . 
|. AAA 


RSENT_TBL = TABLE() 
SS = TRIM(INPUT) 
RSI 1 S = TRIM(INPUT) 
S POS(0) ('«' | "END! RPOS(0)) :S (RSI. 2) 
SS = sss : (RSI. 1) 
RSI2 SS  '«' ARB . NM '>::='! = 
RSENT_TBL<NM> = '|' SS 
IDENT (S, 'END!) :S(RSENTENCE END) 
SS - S : (RSI. 1) 


cpm CE ICE MACC MM CDM MM MMC C CIC EE 
Entry point: The string named STACK will contain all not- 
yet processed information. The string S will contain the 
random sentence being formed. We examine the STACK for a 
syntactic variable, a SNOBOLU expression in parenthesis, 
an assignment operation enclosed in ='s, or, if none of 
these, arbitrary text. 
Ecru re ee Seer A ee IA ————Á—— — ES "^ "Ó— (— n— H—e!——Á EA | 
RSENTENCE 

STACK SYN.VAR = RSELECT (RSENT_TBL<VAR>) : S (RSENTENCE) 
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STACK SNORAL.EXP = :F(RSENT 1) 

S - S EVAL(EXP) : (RSENTENCE) 
RSENT 1 STACK ASGN.VAR = :F (RSENT. 2) 

$VAR =  RSENTENCE('«' VAR '>') 

S = S $VAR < (RSENTENCE) 
RSENT_2 STACK LITERAL. TEXT = :F(RSENT 3) 

S - S TEXT : (RSENTENCE) 
RSENT 3 RSENTENCE = S STACK < (RETURN) 
RSENTENCE END 
Names referenced Name Type Where defined 
by RSENTENCE: BAI * Function Program 8.3 

RSELECT Function Program 16.7 


* indicates name is referenced in the initialization section. 


Epilogue 


A curiosity of sentence generators such as RSENTENCE is that 
it is possible to write a grammar with a chance of looping 
forever. Pohl [1967] gives the following examples: 


«S15::- A | B <851> 
<S2>::= A | <S2> A <S2> | <S2> B <S2> 
<S3>::=#2# A | <S3> A <S3> | <S3> B <S3> 


Whereas <S1> will always halt, <S2> has only a probability of 
1/2 of halting (unlike normal loops, the program will not ac- 
tually run forever because storage requirements will 
ultimately be exceeded; in practice, however, the program will 
appear to be looping because the storage growth rate is 
small). <S3> represents a 'fixed-up' version of «S2» which, 
like «S1», will halt with probability 1. 


The analysis of this phenomenon is based on the notion of ran- 
dom walks with ruin and is treated in detail by Feller [ 1957 ]. 
Let a particle on each step move either to the left or to the 
right. Let it move to the left with probability p and to the 
right with probability q so that ptq = 1. Let P be the 
probability of moving one step to the left, ever. Then  P**n 
is the probability of ever moving n steps to the left. Hence 


P = p+t+gq P? 


This equation has exactly two solutions, viz. P = 1 and P = 
p/q. Curiously, the correct choice does not seem to be 
deducible by a simple argument. It happens to be 1 if p2? q 
and is p/q if p < q. The dividing line of p = q = 1/2 is of 
interest in that the walk is certain to ultimately reach any 
point but the expected waiting time is infinite. 


In the examples above, <S2> loops because, effectively, q 
2/3 and p = 1/3. On the other hand <S3> has p = 1/2 and Q 
1/2 and so the probability of halting is 1 (but just barely). 
In <S1>, we may throw out any alternation that leads to the 
same state so that, effectively, p = 1 and q = 0. 
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SS ee 
(| Program || One use (one hesitates to say application) 
N 16.9 lI of RSENTENCE is in poetry generation (See 
E RPOEM NE Milic [1970, 1971] for a general discussion 
_A<III€¿¿;5;7]F] of this topic and other references). For 


example, if the following were the input to RSENTENCE: 


<PROP>: :=action|duration/hunger | feeling/ activity (| movement | 

motion(notion|endurance|tenderness[(age|taste|bounty|goodness 

<GEN>: :=time| nature|age|wisdom|war]| peace] power{energy|earth| 

love|beauty|charity|faith|hopejthought|strength{ night] 

piety|heart|land{evil 

<SPEC>::=f lower (tree|dove(star|cloud|twig|pond|dog| goat | 

muffinfvetal|wagon wheel|gate{trap|lark|raven|drop|dish{spoon| 
spark |bone|brain|tooth| face|rake|shovel ¡book | cover | whistle 
<PREP>: :=O0n|up/ over funder (within/|besidejoftfin 

<TVERB>: :=reverefworshiplunderstand/|beseech|control |provoke] 

heal |pursue|strengthen {become {kill|arouse{becalm|ensnare 

<IVERB>::=sing|talk{run|aspire| twiddle! think{gurgle|ponder| 

wiggle|bend|simmer{bask|break|tumble|dance|whistle|squawk 

<ADJ>: :=gentle{frail|happy!{ sorrowful|mournful{gay]!]rusty| 
frolicking|wonton {lustful {timid | pensive | timorous |moody 

<AUX>: :=may{can|shall{should|must|{doth 

<NOUN>: :=a <ADJ> <SPEC>fa <SPEC> of <GEN>[the <PROP> of a 
<SPFC>|the «SPEC» <PREP> <NOUN>|<GEN> <PREP> <GEN>]|<GEN>!s 
<PROP>|<ADJ> <GEN> {the «PROP» of <GEN> 

<RPOEM>::=A =ADJ= =SPEC= <AUX> <IVERB> <PREP> =NOUN=/And <AUX> 
<TVERB> <NOUN>./But <NOUN> <TVERB>S <NOUN>/While (NOUN) 
<TVERB>s the (ADJ) (SPEC) ./ 

END 


The first four calls to RSENTENCE ('<RPOEM>*) (with RAN VAR set 
to 1) produces: 


| A lustful twig can twiddle up the tenderness of a spoon 
And can kill the motion of wisdom. 

But the brain beside gay power heals the action of earth 
While the tenderness of a spoon heals the lustful twig. 


A happy muffin shall bask under earth of night 

And can ensnare the pond up charity of earth. 

But the activity of charity strengthens sorrowful faith 
While earth of night beseechs the happy muffin. 


A wonton gate may gurgle under the gate of the age of a star 
And should worship a gay shovel. 

But frail wisdom ensnares the endurance of night 

While the gate of the age of a star pursues the wonton gate. 


A moody cloud shall ponder over the motion of a shovel 
And should beseech the goodness of beauty. 

But war over nature worships a wonton goat 

While the motion of a shovel strengthens the moody cloud. 
EESE RES Ee I EE E —" —————Á( O RN A | 
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where the lines are broken at slashes. Notice that an effort 
was made to vroduce sentences which would be syntactically 
correct and also have some semantic soundness. For example, 


there are three types of nouns, GENeral, SPECific and 
PROPerty. One of the noun phrases is <PROP> of <SPEC>, i.e. a 
property of a specific thing, but <SPEC> of <PROP> is not 
allowed. 


One reason that the random generation of poems has been 
popular is that context-free generators produce very little 
semantic connectivity between words. Since the poet is granted 
license to break such rules we naturally interpret text in 
which such rules are broken as poetry. As Milic ( 1970] has 
observed, we readily "... accept metaphor as an alternative to 
calling a sentence nonsensical." Hence, in generating random 
text it is much easier to randomly generate ‘poetry! than 
prose just as it is easier to randomly generate ‘abstract art! 


than aood pictures. One conceivable application of random 
poetry is as an initial exercise in a poetry-appreciation 
course. The exercise Of explaining the 'meanings' of some of 


the computer renderings can be a mind-expanding experience. 


RSENTENCE may, as we will see, be also used for story genera- 
tion. There are, however, definate limitations in this direc- 
tion. Mendoza [ 1968] describes one effort to improve somewhat 
on the semantic soundness of the generated sentences. Essen- 
tially his method applied weights to different noun-verb com- 
binations so that a squirrel would munch and crunch with a 
greater likelyhood than crawl and swim. This technique 
produced sentences which were internally sound but which had 
very little relation to other sentences. Hence, when Mendoza 
read sets of such sentences to his children as stories, the 
children complained because the stories never got anywhere. 


Using a vocabulary heavily sprinkled with chemical terms, Men- 
doza reported on attempts to pass off randomly-generated  sen- 
tences in a chemistry examination. It is perhaps a plus for 
higher education that the teacher not only did not give a high 
grade to the computer but actually stormed into the Director's 
office shouting "Who the hell is this man - why did we ever 
admit him?" Perhaps what is of interest in these stories is 
that the individuals involved did not see the computer behind 
the qibberish but accepted is as very bad human products. This 
is an advance of sorts. The problem of providing inter- 
sentence connectivity is a challenging one and will be 
considered after taking up the next topic. 
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| £888 IMULATION | The computer may be used to simulate real 
| $ AS events and, in so doing, may determine the 
| $$% | outcome of certain strategies or actions far less 
| % | expensively and more quickly than by concocting the 
| FEE | event physically. Simulation is used where the 


__-____J events to be predicted are not amenable to 
mathematical analysis but where the underlying stochastic 
structure is well-established. Simulations are used in busi- 
ness where transport networks, factories and shops, trading 
centers, etc. may be analyzed, in the study of warfare, 
cities, traffic, demography, biological adaptation and many 
other large and complex situations. Simulations are sometimes 
referred to as Monte Carlo techniques, but this latter term is 
more likely to be reserved for more mathematically-oriented 
Situations. As a crude example, the area under a curve can be 
approximated by generating random number pairs (See Exercise 
16.13) and testing to see if they fall above or below the 
curve of interest. Other areas where simulations can be used 
is in game-playing, sports and gambling. For a specific 
simulation we choose the game of baseball. 


| Sm. | 

i! Program |! The function RSEASON(NG) is intended to 
11 16.10 {| Simulate a random season of baseball. The 
(| RSEASON || number of games is given by the argument NG. 
Á 3 The value returned is the number of runs 
scored in the simulation. The simulation is governed by 


statistics read in at initialization time. One example of in- 
put that could be given is shown in Table 16.4. 


H 
fb 
o 
- 
it" 
o 
[- 


Tabl Shows the line-up and statistics 
for the 1927 New York Yankees. Source is BB 
[ 1969}. Only the data shown in lower center 
was actually input to RSEASON. 


Name | AB H DB TR HR BB | BA 
cce eer poet ocn du cuu RS pos | 
( Combs ( 648, 231, 36, 23, 6, 62 | .356 | 
( Koenig ( 526, 150, 20, 11, 3, 25 | .285 | 
| Ruth 1 540, 192, 29, 8, 60, 138 | .356 | 
| Gehrig { 584, 218, 52, 18, 47, 109 | .374 | 
| Meusel ( 516, 174, 47, 9, 8, 45 | .337 | 
| Lazzeri | 570, 176, 29, 8, 18, 69 | .309 | 
| Dugan | 387, 104, 24, 3, 2, 27 | .269 | 
( Collins | 251, 69, 9, 3, 7, 54 | .275 | 
( Pitcher | 500, 50, 5, 1, 2, 10 | .100 | 


Table 16.4 shows the lineup and statistics of the 1927 New 
York Yankees, perhaps the most powerful hitting aggregation in 


the history of baseball. The statistics given for the pitcher 
are not those of any given player but are an estimated com- 
posite of the entire pitching staff. 


The program is in a sense the simplest possible simulation 
since only offensive data are given for only one team. A per- 
fect simulation would perhaps require that every blade of 
grass be taken into account and is completely out of the ques- 
tion from the standpoint of human effort let alone the fact 
that baseball records, complete as they are, do not show all 
such minutiae. Between these extremes, the pitcher on the 
defensive team and to a lesser extent the fielders do affect 
the performance of the offensive team as a whole and may 
peculiarly effect individual hitters. Another weakness of the 
Simulation is that every player's performance is independent 
of his previous performances and, more severely, of the game 
situation. Some players are considered ‘clutch hitters! and 
pitchers tend to ‘bear down' on hitters in tight situations. 
All of these factors are worth a study of their own to anyone 
interested in a serious simulation of the game. We will be 
content with exploring the principles of simulation. As it 
stands, however, RSEASON could be used to determine the gross 
effects due to line-up changes and permutations in order to 
determine optimal line-ups or to evaluate trades, the effect 
of pinch hitters, etc. 


DEFINE ('RSEASON (GAMES) INNING, RUNS, BASES, OUTS, K') 


NAAA GT A eT ay ME Pa EMI CE LS I ee ga ey eg NN tae 
(A structure, RECORD, is defined to contain the statistics | 
| of one player. STATS is an array, filled during the | 
| initialization period with statistics of the players in | 
| the simulated lineup. | 
A A A II A E A IN | 

DATA (' RECORD (AB, H, DB, TR, HR, BB) !) 

STATS = ARRAY(9) 

I = 0 
RS_INIT O + 1 

STATS<I> = EVAL('RECORD(' INPUT ')') :S(RS_INIT) 

: (RSEASON_END) 


SS SR E E O E E | 
| Entry point and outer loop: Control returns here after | 
| each complete game. Control arrives at RS_1 for each new | 
| inning. BASES will contain the men on base in the form of | 
(a string and OUTS is an integer recording the number of | 
| 


outs. l 
cC ON E E II IIA AE | 
RSEASON GAMES = GT (GAMES, 0) GAMES - 1 :F (RETURN) 

BATTER = 0 
RS 1 OUTS - 0 
BASES - 
Dp di I IL LC IDCM ICE CIC" c: aa a QL D cM d E CAN a, | 
Here for each new batter. His statistics are obtained in 


S; A random number K is obtained based on his total at- 
bats. The variable ADV is set according to how his per- 
formance would advance runners from bases 0, 1, 2, and 3. 
The actual advancement is done at RS 4. An exception is 


-—— -——— A aD wA 
-mn ab we m am 
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| the walk (BB) in which advancement is context sensitive | 


| and so must be treated as a special case at RS_BB. | 
m RC A EN a a A E AI | 


RS 2 BATTER =  EQ(BATTER,9) 0 
PATTER = BATTER + 1 
S = STATS<BATTFR> 
K = RANDOM(AB(S) + BB(S)) 
ADV = GT(K,AB(S)) '1223° : S (RS_BB) 
OUTS = GT(K,H(S)) OUTS + 1 :S(RS OUT) 
ADV = LE(K,HR(S)) "'RRRR' :S(RS 4) 
ADV = LE(K,HR(S) + TR(S)) '3RRR! :S(RS 4) 
ADV = LE(K,HR(S) + TR(S) + DB(S)) '23RR' :S(RS_4) 
ADV =  '12RR' 
RS 4 BASES =  REPLACE(BASES 0, '0123', ADV) : (RS. 2) 
RS BB BASES  '321' =  'u21! 
BASES  '21' = '31! : (RS, 4) 
Be eee eae 


Sn tn ee acc O E E ED 
| If there are not three outs, determine the number of RUNS | 
| scored this inning by scanning BASES. Add to total | 
| (RSEASON). Then check to see if we've completed 9 INNINGS. | 
¡A — € ———e € — ——— — I(Ó——G—— A" A 


RS OUT  EQ(OUTS, 3) :F(PS 2) 
RUNS = 0 
BASES SPAN('R'*) @RUNS 
RSFASON = RSEASON + RUNS 
INNING = INNING + 1 LT(INNING,9) :S (RS, 1) 
INNING = 0 : (RSEASON) 
RSEASON_END 
Names_referenced Name Type Where defined 
by RSEASON: RANDOM Function Program 16.1 


One of the most important aspects of a simulation is how to 
interpret the numbers. For example, to simulate a season we 
may call RSEASON(154) and find that 978 runs were scored. But 
repeated calls to RSEASON(154) will produce slightly different 
numbers. An actual sequence obtained was: 


978 1013 1068 1004 886 999 1053 1039 


These eight numbers average to 1005. In general, the more 
numbers we obtain the closer these numbers approach some 
limiting value. Since computation can be expensive and time- 


consuming, we may well ask how far we must pursue the 
statistic-gathering before the average settles down to 
something reasonable. Said another way, how can we estimate 
the error of such a computed average? 


Let M be the mean of n numbers X, X2 ... Xp. That is 
M = (X4 * X2 t ... + Xn) fon (16.1) 
It is well known [Feller 1957] that if the X,, Xs, ... «Xn are 


independent then no matter what their distribution (assuming 
their means and variances are not infinite), their sum S 
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S = X1 + Xo + eee + Xn 


approaches a Gaussian distribution whose standard deviation 


(or standard error) E can easily be estimated from the 
formula: 


E2 = (X, - M)? + (X2 - M)? +... + (Xn - M)? (16.2) 
The sum S will be in error by about E. Moreover, we may be 
95% confident that S is within + 2E from the average value. 
Hence we may with the same confidence (95%) expect that the 
asymptotic average will ke in the range: 
S/n +t 2E/n 


As an example, given the previous 8 numbers, we obtain 


E2 = 729 + 64 + 3969 + 1 + 14161 + 36 + 2304 + 1156 
= 22420 

E = 150 

S/n + 2E/n = 1005 + 37.5 


For long sequences of numbers, (16.2) is not in the most con- 
venient form, since the mean M is not available until the last 
number Xn is seen. Rewriting (16.2) using (16.1) we obtain: 


E2 = (X,2 + X2? +... + Xn?) - n M? (16.3) 


Note that E? varies roughly as n and so E/n varies inversely 
as the square root of n. Hence in order to reduce our range 
Of error by a factor of K we must gather K? times as many 
Statistics. Hence, precision is expensive and, for this 
reason, Simulations are used only when analytical techniques 
are not available. 


To determine the effect of modifying the batting order, 
RSEASON(154) was called 45 times with the lineup as indicated 
in Table 16.4 and 45 times with Ruth and the pitcher inter- 
changed. In the first case the average runs scored per season 
was 1009 +14 where 14 is the 95% confidence interval. In the 
second case the average was 971.5 +14. The experiment clearly 
shows the efficiency of the given lineup over the postulated 
one. 


One curiosity remains however. The number of runs the Yankees 
actually scored that season was 975. This in spite of the fact 
that pinch hitters, clutch hitting, extra-inning games, errors 
and better pitcher-hitting than .100 would have made the ac- 
tual figure higher than the simulated figure. On the other 
hand, the Yanks won 110 games that year. If say 70 were won 
at home then they missed one inning out of twenty which would 
account for 50 runs. Almost certainly, good clutch pitching, 
if not choke hitting, could account for the rest. 
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(| Program || As indicated by Mendoza (Epilogue to RPOEM, 
[KM 16.11 {| Prog. 16.9) sequences of sentences which 
If RSTORY N bear little coherence one to the other are 
d not particulary interesting even to children 


let alone the flabergasted professor. At first sight, the 
ability to produce an actual story may seem quite beyond the 
state of the computer art. However, it is not essentially 
difficult to supply the desired connectivity by using some un- 
derlying simulation to form a developing plot and use the ran- 
dom sentence generator to supply verbal 'suguring'. This is 
amply illustrated by the baseball simulation (RSEASON) which 
would be quite easy to modify to produce a 'meat and potatoes! 
narration such as: "... Ruth makes out, Gehrig hits single, 
Meusel makes out, End of inning, no runs ... ", etc. For the 
purpose of story-generation, descriptive phrases, chosen at 
random could further embellish the tale adding needed color 
(See Exercise 16.16). 


For the generation of stories which may appeal to children, a 
child's game may be simulated. There are many games on the 
market in which tokens moving over a board carry the child 
through a sequence of adventures often with a competitive ele- 
ment thrown in which would make the story interesting. Board 
games, such as Monopoly, have been programmed and most 
children's games are considerably less complicated than this. 


One method of producing random stories which only vary weakly 
from each other is to locally perturb certain variables of a 
given pre-concocted story. There are children's books on the 
market which utilize this principle in producing personalized 
books. In addition to using this principle, RSTORY, below, 
attempts to utilize a collection of semantically rich (or at 
least richer) information of the form <agent> <adversely 
operates upon?» <agent>.  RSTORY draws upon these relationships 
in order to produce a simple 'actor-action'!' chain which this 
classic children's story requires. 


Process phrases - We assume that RSENTENCE has read in all 
syntactic variable definitions. All phrases are of the 
form SUBJECT VERB OBJECT. For each object expressed or 
implied in a phrase, we make an entry in the table ACTIONS 
which will contain the subject and object. 
O | 

ACTIONS = TABLE() 

BB = BREAK(' !) 

SB = SPAN(' !) 
READ_PHRASE 


X = TRIM(INPUT) :F(BEGIN STORY) 
IDENT (X, 'END!) : S (BEGIN, STORY) 
X (BB SB BB) . SUBJ VERB SB REM. OBJS 
OBJS = OBJS tp" 

READ PH1 
OBJS  POS(0) '<* ARB . VAR '>' = RSENT_TBL<VAR> 


OBJS POS(0) '|' = :S(READ PH1) 
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CBJS  BREAK('|') . OBJ '[' = :F(READ PHRASE) 
ACTIONS<OBJ> = ACTIONS<OBJ> '|' SUBJ VERB 
: (READ PH1) 


REGRESO A A MC ELE MM EE MICE O ee ID DESEE 
| The story's setting and the principal characters are in- | 


| troduced here. | 
| No a A Sa ES ee A IN ee TE TEE | 


BEGIN_STORY RSTORY = RS ENTENCE ('<OPENING>!) 
LIST = ' *¢* PET " won't jump over the " BARRIER 
LAST = PET 
&MAXLNGTH = 30000 


NAAA SR IT aR SE TSS O | 
| Find a new agent; we will try ten times to produce a verb | 
{ and an agent that we haven't seen before. l 
Lore 


NEW_AGENT 
TRY = 0 

RETRY TRY = TRY + 1 LT(TRY,10) : F (REQUEST) 
ALTS = ACTIONS<LAST> 
RSENTENCE (RSELECT(ALTS)) BB . SUBJ SB REM. VERB 
RSTORY ' ' SUBJ ' ! : S (RETRY) 
RSTORY ' ' VERB ' ! :S (RETRY) 


RE e GNIS C EIC EN MEC M MM E te ee E EECES 
| Here the refusal is added to the story as well as descrip- | 
| tive text relating to finding a new agent and making a | 


| request. i 
| prc RORUT S IN IN III ea E AI 
REQUEST RSTORY = RSTORY RSENTENCE('!XREFUSAL»'!) 
LIST = ' ' SUBJ " won't " VERB * the * LAST '", " LIST 
LAST = SUBJ 


RS E E ES IM E MCI I MACC MEM 
| If the agent complies freely with the request, control | 
{ falls through the next test and the story is essentially | 
| over. | 


LT(SIZE(LIST), 175) :S (NEW. AGENT) 
FIN1 LIST "won't" = "began to" 3S (FIN1) 
FIN2 LIST "',' = "; the" <: S(FIN2) 
RSTORY = RSTORY RSENTENCE (!' XKPERSUADED? !) 


pcc MC ICD IPM D MICE TS A A ICM CCCC EFL ee IT ee OMM DC ee M CMM DE AN 
| Now output the story. | 
————————————————— V -———— Á———— ——Á—HÀ——À———— | 
OUT RSTORY  (LEN(50) BB) . OUTPUT SB = < S (OUT) 

OUTPUT = RSTORY 


a a ee eee M C ee ae oy Nig D E ee 
| Below find the input data to the program. The first half | 
( (up to END) is processed by RSENTENCE. Following this we | 
I| find the phrases on which the story is based. | 
——————————————"——— O DON PAI  ———Á—— E | 
END 
<OPENING>: :=<TIME> there was a =CHAR= who went to <PLACE> and 
bought a =PET=. On the way home they came upon a =BARRIER= 
which the (PET) was afraid to cross. The (CHAR) said " (PET), 
(PET), jump over the (BARRIER) or I won't get home tonight." 
<TIME>: :=0nce upon a time[Once[Long ago in a small village] 
In days gone by in a little town by the river 
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<PLACE>::=market|a pet store|a super market|town|the city 

<BARRIER>: :=fence/ditch/ fallen tree[large rock|stream| brook 

<PET>: :=dog|{cat | parrot | pony 

<REFUSAL>::= But the (LAST) would not. The (CHAR) 
<EXCURSION> and she met a (SUBJ). She said, "(SUBJ), (SUBJ), 
(VERB) (LAST), (LIST) and I shan't get home tonight." 

<EXCURSION>: :=went down the path{went over a hill|went by 
<OBJECT> and then <EXCURSION> [went toward <OBJECT> | 

went over hill and dalefwent near <OBJECT>/went on the road to 
<OBJECT>|went for (RANDOM(20) + 1) miles 

<OBJECT>: :=the «COLOR» «THING? 

<COLOR>: :=white|blue|red{yellow|grey|black|dark|green|orange 

<THING>: :=mill|tavern{church|school |house|meadow|rock{barn 

<PERSUADED>::= The (SUBJ) knew the (CHAR) and, in fact, 
had been saved by her from a wild <WILD_AN>. So the (LIST) 
and the (CHAR) got home that night. 

XCHAR»::-little old woman|jlittle old lady{kind grandmother | 

kind old aunt{little girl dressed in red/|retired seamstress| 

nice old lady|{little girl green 

<DOM_AN>::=cow|pig{horse{sheep ¡chicken 

<WILD_AN>::=lion|giraf fe |tiger|camel|ostrich|rhinoceros 

<ANIMAL>: :=<DOM_AN>|<WILD_AN>|{<PET> 

<HUMAN>: :=farmer |girl|policemanfhunter |man| boy 

<A>: :=<HUMAN> | <ANIMAL> 

<CUT>: :=cut{slice|snip|slash 

<CUTTER>: :=knife|scissor |sword|dagger 

<BEE>: :=bee|wasp|horse-fly 

<HURT>: :=bite|frighten|scare|kickjeat 

END 

<ANIMAL> <HURT> <HUMAN> 

<CUTTER> <CUT> <A> 

<A> break <CUTTER> 

water drown <A> 

<A> drink water 

fire burn <A> 

smoke suffocate <A> 

<BEE> sting <A> 

<A> swat <BEE> 

wind blow-out fire 

wind disperse smoke 

smoke pollute wind 

smoke smother fire 

<HUMAN> disperse smoke 

<A> spill liquor 

liquor intoxicate <A> 

<HUMAN> slay <WILD_AN> 

<WILD_AN> eat <HUMAN> 

END 


Names referenced Name Type Where defined 


by RSTORY: RSENTENCE Function Program 16.8 
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Epiloque 


One example of a story produced by the program (untouched by 
human hands) is: 


Long ago in a small village there was a little old 

lady who went to a pet store and bought a cat. On 

the way home they came upon a ditch which the cat was 
afraid to cross. The little old lady said "cat, cat, 
jump over the ditch or I won't get home tonight." 

Put the cat would not. The little old lady went over 
hill and dale and she met a water. She said, “water, 
water, drown cat, cat won't jump over the ditch and 

I shan't get home tonight." But the water would not. 
The little old lady went on the road to the red school 
and she met a man. She said, "man, man, drink water, 
water won't drown the cat, cat won't jump over the 
ditch and I shan't get home tonight." But the man 
would not. The little old lady went toward the blue 
church and she met a lion. She said, "lion, lion, 

eat man, man won't drink the water, water won't drown 
the cat, cat won't jump over the ditch and I shan't 
get home tonight." But the lion would not. The little 
old lady went toward the yellow rock and she met a 
smoke. She said, "smoke, smoke, suffocate lion, lion 
won't eat the man, man won't drink the water, water 
won't drown the cat, cat won't jump over the ditch 
and I shan't get home tonight." But the smoke would 
not. The little old lady went toward the blue house 
and she met a girl. She said, "girl, girl, disperse 
smoke, smoke won't suffocate the lion, lion won't 

eat the man, man won't drink the water, water won't 
drown the cat, cat won't jump over the ditch and 

I shan't get home tonight." The girl knew the little 
old lady and, in fact, had been saved by her from a 
wild ostrich. So the girl began to disperse the smoke; 
the smoke began to suffocate the lion; the lion began 
to eat the man; the man began to drink the water; 

the water began to drown the cat; the cat began 

to jump over the ditch and the little old lady got 

home that night. 


The reader will note that the story tends to be repetitious 
which is somewhat the point since small tots have a penchant 
for this sort of thing. 


In order to extend the robustness of the given program (where 
robustness is defined as the degree to which the stories vary) 
one may, of course, extend the vocabulary. One of the limita- 
tions so encountered, is the necessity within English to 
observe certain grammatical niceties such as using 'she'! to 
refer to a woman. This single fact, incidently, is the reason 
that the principal character in the story has feminine gender. 
TO include any gender, one would at least need a function 
PRONOUN (W) which will return the third person singular per- 
sonal pronoun for any word given as argument. While this task 
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is not formidable (with a limited vocabulary) a complete set 
of grammatical transformations which would include, for exam- 
ple, present tense to past and future, active voice to pas- 
sive, indicative mood to subjunctive, singular to plural, 
represents a considerarle undertaking. Thus, with story 
generation, as opposed to mere sentence generation we come to 
grips with much more severe syntactic problems. 


The semantic difficulties involved in considerably extending 
the robustness of the story generator are also of interest. 
It should be clear that the vocabulary section of RSTORY can 
be completely overhauled to produce stories in such diverse 
settings as the wild west, interplanetary travel, the Jurassic 
period (dinosaur days), etc. A weakness of the system is that 
one could not place the union of all such information into the 
story since, for example, the «excursion» variable might 
produce "the cowboy drove his spaceship past the red 
pterodactyl." We should want to at least draw actors and ac- 
tions into the story on a logical, though perhaps 
probabilistic, basis. The problem seems somewhat similar to 
the Analogy Problem [Tuggle 1973] in which a program attempts 
to fill in the blank in a sentence of the form 


A is to B as C is to M 
Here, a sufficiently rich data base makes such problems trac- 
table. Returning to our story, if CHAR is our principal 
character and we wish her (him) to travel we may say: 


"cowboy is to horse as CHAR is to __" 


in order to find an appropriate means of transport. We can 
see a bit of this in the specialized data section of  RSTORY 
(the second set of data) which sets forth relations between 
individuals and specialized groups to obtain greater realism 
at the expense of robustness. These relations are, of course, 
all of a certain kind, viz. of the form «agent» «affects» 
<agent>. Increasing the kinds of relations is essentially what 
is required to solve the Analogy Problem. Thus, RSTORY may be 
augmented by the possibility of having one or more of the 
chain of agents wander off (after having been lined up) ina 
manner consistent with the agent (water might evaporate, fire 
burn out, lion be distracted by game, etc). This would add 
another dimension to the story. 


On a deeper level, one may wonder whether it is possible for 
the computer to play a greater role in the formation of the 
plot and deciding on the 'point' of the story. Would computer- 
qenerated stories always remain in the entertainment category 
or could they serve some useful function such as describing 
some complex event within, say, an operating system? The ques- 
tion of randomly generated stories is currently a topic of 
considerable interest. See AI FORUM [ 1974] for a vigorous 
discussion and several other references. Also Knuth [Vol. 2] 
describes a random western which was used as the basis for a 
television film. 
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{| Exercise 16.1 | RANDOM(0) has a distribution which is 
(AAA uniform over the interval (0,1). It is 
sometimes required to have other kinds of distributions. 
Define the distribution function (sometimes called the cumula- 
tive distribution function) D(X) of a random number generator 
R() as the function 


D(X) = Prob{ R() < X } 


For example, the distibution function assocated with the 
uniform distribution slopes between 0 and 1 in the range (0,1) 
and is 0 below and 1 above this rannge. Given an arbitrary 
distribution function D(), write the random generator R() in 
terms of the uniform generator RANDOM() and the inverse of 
D(), call it ID(), which is presumed to exist. 


A | 
| Exercise 16.2 | Suppose that a program requires random num- 


AS bers between 0 and 1 in such a way that x 
is x/y times more likely to occur as y. Thus 1/2 is twice as 
likely to occur as 1/4. Write the distribution function D() 
for the generator. Write a program to produce the random num- 
bers (functions in the ARITHMETIC chapter can be used). 


[TUTTA CI 
| Exercise 16.3 | Let a deck of cards be represented by 52 
ANS separate characters, say: 


DECK = tab ... ZAB ... Z' 


In one statement, deal out four 5-card poker hands to players 
P1, P2, P3 and P4. (Any function(s) in this chapter may be 
used.) 


[72-1 SUY A eee 
| Exercise 16.4 | A well-known game is to find, for a given 


(AS telephone number, a sequence of letters 
which (1) when dialed will produce the same number and (2) are 
a pronouncable sequence. For example, 233-6874 can perhaps 
more easily be remembered as 'BEDMUSH' or tADDNURI'. The cor- 
respondence is: 


2 ABC 6 MNO 
3 DEF 7 PRS 
4 GHI 8 TUV 
5 JKL 9 WXY 


(1's and 0's create problems). 
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Write a function RPHONE to accept a telephone number and 
return a random sequence of letters associated in the above 
sense with the number. The sequence should bear some 
similarity to English; to do this, use RCHAR for probable next 
characters. 


Ce ee CUORE 
| Exercise 16.5 { What single statement can be modified so 


AS that RSELECT (Prog. 16.7) saves space 
rather than time? 


CS ee ge ee ee ee 

| Exercise 16.6 | Augment the assignment interpreter in 
AV  RSENTENCE so that the variable assigned in- 
to need not also be the name of the syntactic variable expan- 
ded. One way to do this is to let 


=var/s= 
be interpreted as: 


var = RSENTENCE (s) 


| nna | 

| Exercise 16.7 | If the argument to RSENTENCE is not well 
t——————————————-4 formed, the function can loop. Give an ex- 
ample of a string which will have this effect. What 


modificaiton to RSENTENCE can correct this. (Requires the ad- 
dition of six characters and a blank). 


au EE NS E 
| Exercise 16.8 | This exercise is based on a suggestion by 
tL———————————-—4A Yngve [1962]. In the input to RSENTENCE 


let /text/ indicate that the result of evaluating text (via 
RSENTENCE(text)) is to be placed in the stack after the next 
item. An item is defined as either a syntactic unit or a se- 
quence of non-blanks. Thus 


X«SENT»::- <NOUN> <VERB-PHRASE> <NOUN> 
<VERB-PHRAS E>: :=<VERB>/ <ADVERB>/ 


can result in " He called her up". Incorporate Yngve's sug- 
gestion into RSTENTENCE. 


E ENES ee es ee A 

{| Exercise 16.9 | In RSENTENCE, there are several characters 
3 which can't be used directly within alter- 
natives because they have some meta-meaning (such as <>] etc.) 
Define an ‘escape’ convention so that any special character 
can be incorporated in the final text. Implement your scheme 
(Hint: this can be implemented by modifying one pattern). 


(AAA ee ee 
| Exercise 16.10 | For which of the following definitions 


AS will <S> have a probability of looping 
greater than 0. 
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(a) <S>: : =Af<S>A 1 <S><S>A 
(b) <S>2 2 =#2#A1(< SDA] <S><SDA|<S><SO<SOA 
(c) <S>: 3: =A] <T<T> 


<T>: :=B(<S>C 


Co Se oe ee 
| Exercise 16.11 | What is the probability that 
TA 


<S>::=A|<S>A<S>B<S> 
as input to RSENTFNCE will halt? 
eet A ae ee DO N 
| Exercise 16.12 | The 'one-arm bandits' of gambling fame 
LLL————————————À (also known as slot machines) have three 


windows in which one of 20 pictures can appear as follows 
[Spencer 1968]: 


Symbol { Wheel 1 | Wheel 2 | Wheel 3 
pe ge eee ere Re re e ee S mdi ui 
Cherry (C) ( ü | 6 | 0 
Orange (0) | 5 | ü | 7 
Bell (E) i 4 | 6 f 5 
Lemon (L) | 3 | 2 | 4 
Watermelon (W) | 3 i 1 | 3 
Bar (B) | 1 i 1 | 1 
Payoffs are as follows: 

C= = 3 WWB 15 

C CcC- 5 ooo 18 

O OB 6 WWW 20 

EEO 8 B B B 200 

LLL 10 


Identify the sample space. Determine the total input to the 
machine and the total return if each item in the sample space 
is hit once and only once. What percentage of total bets is 
taken by the machine? Write a program to simulate the slot 
machine (can be done in as few as 10 statements using SUBSTR 
(Prog. 3.9) and RANDOM). 


r7 are ae | 
| Exercise 16.13 | (a) Write a program to compute the area 
i. ————————-A under the curve Y = X? on the interval 


[0,1) by Monte Carlo techniques. Print out this area every 
100 samples so that you can observe the rate at which the 
answer converges to its correct value (1/3). (Hint: this re- 
quires a total of three statements). (b) Compute the 95% con- 
fidence interval after N trials and compare this figure with 
the experimental results. 


| a ee ee ee MOL ae 

| Exercise 16.14 | To speed up the previous exercise, DUPL 
t—-——————— and CODE can be used so that the inner 
loop of three statements is reduced effectively to one. How 


can this be done? 


( (rrr | 

| Exercise 16.15 | Modify RSEASON (Prog. 16.10) so that with 
3 probability E a batsman will advance to 
first by means of an error where otherwise he would simply 
have made an out. All other runners should advance one base. 


po TR. 

| Exercise 16.16 | Write a program called RGAME which will 
A behave like PSEASON except that  RSENTENCE 
is used to supply running commentary of the events which 
transpire. Include names of players in the input data. Make 
your game colorful. Don't have a player merely make an out, 
have him hit a sharp drive to center which is speared by the 
centerfielder. 


qr TET ETT ETT VUELTA 
{| Exercise 16.17 | Sagasti and Page [1970] describe an effort 


t———— to program and actually stage a computer- 
generated dance routine. The stage is divided up into 13 areas 
roughly as shown in Figure 16.1 


Figure 16.1 
The decomposition of the stage to produce a random 
dance. 


A dancer is permitted to move from one circle to an adjacent 
one; for example, in Figure 16.1 a dancer at F can move to any 


of A, B, Es Gr J, Or K; of course, the dancer may also remain 
at the same position. Dancers may exit and enter at random 
times but only to or from what may be called terminal nodes. 
For the exercise, let E, J, K, L, M and I be the terminals. 
Also, no two dancers may occupy the same spot at the same 
time. 


Implement a program to produce a random dance with the ad- 
ditional constraint that there be left-right symmetry. That 
is, for example, if a dancer moves from A to B then another 
dancer must move from D to C. To allow movement into the cen- 
ter position, create a new position Y which is offstage cen- 
ter. If a dancer at K goes to G then the dancer at L must go 
to Y, etc. Also, permit dancers at G and Y to change places. 
Denote offstage left as position X and offstage right as posi- 
tion Z. The output of the program should be a list of instruc- 
tions for each of eight dancers. 


Be careful! Sagasti and Page describe their initial efforts as 
resulting in “pandemonium on stage" until a slower tempo was 
found. They also described one dancer as "mildly bitter" being 
forced to leave early. 


qute 
| Exercise 16.18 | Change the story given by RSTORY to one 
CAS involving a space motif. Use RWORD to 


provide stange-sounding names of people and planets. 
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r— ames are artificial environments frequently abstracted 
Ir— from reality intended to amuse and/or exercise the 
| | cranium. The computer (and computer programmers) are 
ts} quite proficient at simulating such abstractions, much 
CJ more so than the reality backdrop, so that there has 
for a long time been a happy marriage between computers and 
game playing (frequently to the chagrin of management intent 
on putting the high-priced piece of equipment to better use 
than amusing its high-priced employees). As the cost of com- 
putation diminishes, however, the recreational or game-playing 
applications of digitial computers may be expected to 
increase, and surely any survey of SNOBOL4 applications would 
not be complete were it to ignore this area entirely. The 
computer is, after all, the ultimate game if not the ultimate 
player. 


We almost, but not quite, include under the heading of games, 
attempts to make the computer behave (i.e. converse) like a 
human. Weisenbaum [1966] made a notable attempt in this direc- 
tion with his program ELIZA. ELIZA will converse with the user 
in a form characteristic of a script given to it as data. The 
most familiar and popular script makes ELIZA behave like a 
psychiatrist. Though ELIZA was originally written in Fortran, 
Duquet [1970] has written a ‘dramatically shorter! version in 
SNOBOL4. In SNOBOLU, the program is actually smaller than the 
psychiatrist script (two pages versus four). While we do not 
include the program here, we note in passing that dialogue is 
a necessary aspect Of most games and a snappy dialogue can add 
an appeal to an otherwise not-too-exciting game. We will 
return to this issue later. 


For good or ill, many games have been programmed on the com- 
puter. At a nearby PDP-10 time-sharing computer there exists 
twenty-some games including Chess, Go, Black Jack, Go-Moku, 
Monopoly, Tick-tack-toe (two and three dimensions), Nim and 
games based on football, golf and Startrek to mention only 
those names that are immediately recognizable. There are many 
other games which have been, or will be, written for a digital 
computer; see Spencer [1968], Ball [ 1962] and especially Ahl 
[ 19737. 


A game may be concealed or open. In an open game, such as 
Chess or Checkers, all information concerning the state of the 
game is available to both players. In concealed games, such 
as in many card games or in penny matching, each player may 
have information unavailable to the other. This is clearly 
the case if one is holding cards unseen by one's opponent. 
With penny matching, the concealed information is the player's 
strategy. In a concealed game, the player must play in such a 
way as not to reveal his hidden information and therefore the 
techniques and analysis are quite different from the open 
game. 


In concealed games, there seems to be a problem involving 
player and computer credibility which does not exist with the 
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Open game. Consider the game of penny-matching in which both 
players choose a side of a penny; one player wins (the other 
player's penny) if there is a match; otherwise the other 
player wins. With a computer there is a problem. If the com- 
puter goes first, there is the possibility that the player 
will cheat. If the player goes first, he may suspect the 
machine of cheating. Hagelbarger [1956] built a penny-matching 
machine, called SEER which ‘solved! this problem by the human 
saying aloud his choice of head or tail and the machine 
(sensitive only to sound) would indicate its choice whereupon 
the player would tell the machine, by a push button, who won. 
The machine can't cheat under these circumstances but the 
human certainly can. A counter was wired up to accumulate 
total wins and losses for the machine. Though the machine won 
most of its games, the results are clouded by the fact that 
some players would deliberately lie to the machine to see how 
it would operate in stressful situations. 


One solution to the concealment problem lay in the use of a 
one-way cipher (See ONEWAY, Prog. 16.4). Recall that given 
the returned value of ONEWAY(S) it is impractical to compute 
the original S or, indeed, any S which would yield the same 
returned value. Hence the computer can choose a random string 
R (possibly based on the clock) and then call ONEWAY(R 'H') if 
it chooses a head or call ONEWAY(R 'T') if it chooses a tail. 
The computer prints the returned value. Then the player plays. 
The machine then reveals its move together with R. The player 
can check, if he cares to, whether the previously printed 
value corresponds to the given value of R. Spot-checking a 
machine for fraudulent behavior should, in this way, be fairly 
easy. 


A one-way cipher can also be used to make sure that a computer 
is giving you a fair deal. See Exercise 17.1. 


Decision Trees and Decision Graphs 


A decision tree exists, at least conceptually, for any 
discrete open game. The top node, or root of the tree, 
represents the decision node of the first player and has a 
branch descending down for each possible choice of the first 
player on his first move. Each such branch descends to a node 
representing the decision node of the second player, etc. An 
actual decision tree is produced for a simple version of the 
stone game (see Figure 17.1). 


Decision trees grow exponentially and hence tend to be large. 
A complete decision tree for the game of Tick-tack-toe is for- 
bidding enough. One for the game of Chess is so large as to 
be meaningless. For example, at 10 moves per play and for 70 
plays, the number of nodes in the tree exceeds the number of 
atoms in the earth. 


It is more convenient to think of an open game as a collection 
of states where each move carries the play to a different 
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State. There are terminal states which end the game and in- 
dicate a winner for one of the players. If every different 
move sequence leads toa different state, then the decision 
tree is equivalent to the decision graph. But in many games, 
the number of different states is far fewer than the number of 
nodes in the decision tree and the problem becomes amenable 
with a graph even though it appears to be impossible with a 
tree. 


One of the appeals of the decision tree is that it leads 
conceptually to a solution by means of the minimax process. 
The first player (A) selects that node which will maximize the 
outcome for him assuming that the second player will respond 
with the move that will minimize the output for A assuming 
that the first player responds with the move ... , etc. This 
Strategy may be carried over to the decision graph as follows. 
Label all terminal states as *1 if a victory for the first 
player and -1 if a loss and 0 if a tie. Find a state that is 
directed only to terminal states. If it is a move by A, mark 
it with the maximum of the values of all states reachable from 
it. If it is a move by player B, mark it with the least such 
value. Each state will be thus marked with the value of the 
State to player A (assuming both players play optimally). If 
there is no state which is directed only to states already 
marked, then the game is not well-formed as it contains loops 
(or, what is equivalent, infinite paths). 


It will clearly be impossible to present a large number of in- 
tricate game-playing programs in this section. One complete 
chess program could perhaps occupy the better part of this 
book. What we can do is present a few games illustrative of 
their type and also give some commonly useful functions. 


qM a a ee ee 

tI Program || For many computer-game players it is neces- 
E 17.1 N sary to provide a carrot and a stick; other- 
(| PHRASE 1! wise, they will simply lose interest and 
tl quit. For the carrot we will issue a random 


compliment and, for the stick, we will generate an insult. 
These are illustrated ty the two functions PRAISE() and 
INSULT () . There is also a function to mark time called 
LETMESEE(). Using RSENTENCE (Prog. 16.8) the dialogue is al- 
ways fresh and lively. 


DEXP ("PRAISE() = RSENTENCE ('<PRAISE>') ") 
DEXP ("INSULT () = RSENTENCE ('<INSULT>*) ") 
DEXP ("LETMESEE () = RSENTENCE('<LETMESEE>') ") 


Names_referenced Name Type Where defined 
by PHRASE: DEXP Function Program 14.1 
RSENTENCE Function Program 16.8 


The input for RSENTENCE is: 
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E DO eee C Rmo uae CEP cup ard Ge ee Ge GEDUR SED ce ED (UU RUM ED GD GE ee CD AO E es ee ee ee ee e OA 


<GOOD>: :=excellent (wonderful {nicej careful| impeccable | shrewd | 

clever([nifty(goodI|smart|skillful|cunning|witty|fine| 

splendidj{elegant|#S5#very <GOOD>{bright|brainy|brilliant|sharp| 

keen{nimble-witted|slick{slyl|astute| penetrating 

<LETMESEE>: :=<THOUGHT> | <MUMBLE>|<MUMBLE> «THOUGHT? | <THOUGHT> 
<MUMBLE> 

<MUMBLE>::=Hmmmf{Ahh|Well Well|Gosh|{GeejOK{]Oh man|Let's see] 

Wait a minute|Interesting|Wow|Wowee [Yipes] Zowee | Whoosh| 
#5#<MUMBLE> <MUMBLE> | #6#<MUMBLE>... 

<THOUGHT>: :=<LETME> <CONSIDER> <THIS> 

<LETME>::=I think Itll|let mej{I need time to([I'm going to 
have to 

<CONSIDER>: : =consider|contemplate|mull over{#4#<THINK> about 
<THINK>: :=think |see|cogitate|meditate 

X«THIS»::-this|this one/the situation|this problem|this here 
<P1>::=maneuver |[strategem{tactic|play| move 
<P2>::=performance|game|effort 

<P3>::=play|strategy 

<P 13>::=<P1>s/<P3> 

<P23>: :=<P2>/1<P3> 

<P123>: 2: =<P1>s |<P2>/<P3> 

<PRAISE>::=<THANKS> for the game, <NICEGAME> 
<THANKS>::=Thanks (Thank you(Thank you very much 
<NICEGAME>::=I admired the <GOOD> <P123> on your parti 

that was «GOOD» <P3> on your part|your <P1>s were quite 
<GOOD>/|it was a pleasure to play against one so <GOOD>|I 
enjoyed your <GOOD> €P123»5|I enjoyed particularly that last 
<GOOD> <P1> 
X«STUPID»::-stupid([dumb|blundering|thick-headed|sad| 

thick-skulled{silly|ludicrous|witless| poor{| ponderous | 

brainless|foolish{bungling|heavy-handed|graceless|clumsy 
<FCOL>: :=fool1|dolt|idiot joaf |blockhead|chump|ass{moron[{ninny| 

nincompoop {chump {dunce |bonehead|fathead|imbecile|jerk| baboon 
<INSULT>: :=You <STUPID> <FOOL>{I have never seen such «STUPID»? 
<P13>[fYour «STUPID» <P23> befits a <STUPID> <FOOL>| 

Your <STUPID> <P1>s indicate that you are a <STUPID> 
<FOOL>|A <STUPID> <FOOL> is not so <STUPID> as youl 

Your <P23> marks you as a <STUPID> <FOOL>{Your <P1>s are 
less than <GOOD> 

END 


Epilogue 


While random sentence generation has been around for quite 
some time, it generally comes in the form of a program which 
prints something. It is then neither obvious nor easy to har- 
ness the sentence generation for other than demonstrating the 
effect. It was for this reason that RSENTENCE was written as 
a function. 


Some sample phrases are: 


"Thanks for the game, that was nice strategy on your part" 
"You dumb idiot" 

"Interesting Hmmm..." 

"T'm going to have to consider this" 
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"I have never seen such thick-headed strategems" 
"Thank you for the game, your plays were quite shrewd" 


It should be obvious which phrases were respectively returned 
by INSUIT(), PRAISE() and LETMESEE(). 


ES | 


(| Program |! QUEST is intended to save some of the 
B 17.2 (i routine problems and house-keeping chores 
B QUEST N associated with a dialogue system. For ex- 
oe ample, all game routines will request num- 


bers and/or strings from the player. The system must then 
check if these arguments are valid and, if not, indicate what 


is expected. If valid, the argument must be interpreted or 
assigned to a variable and an appropriate branch must be 
taken. Certainly, none of these chores are difficult to do, 


but it will be more convenient to combine them into one 
routine. For example, 


QUEST ('How much do you wish to bet ?/BET(1...10) | (DROP) DR") 
+ : S ($LABEL) 


will print the message: 
How much do you wish to bet? 


(i.e. all characters up to the slash) and then either accept 
an integer in the range 1...10 and assign it to BET or accept 
the literal input DROP and transfer to label DR. The transfer 
is accomplished by having QUEST assign the string 'DR' to the 
global variable LABEL; if such an assignment is made, the 
RETURN exit is taken; otherwise the FRETURN exit is taken. In 
this way, the actual transfer takes place outside the function 
as shown. 


In general, the string following the slash is called the QUEST 
pattern and is a sequence of descriptors separated by bars. 
Each descriptor is of the form: 


variable (values) label 


The variable, if any, is assigned the value (if accepted) and 
the label is assigned as described above. Values may be of 
the form: 


number...number 


or some string constant, or the string ARB implying that any 
string of characters will ke accepted. 


If the user types something that doesn't match, an error mes- 
sage (including a random insult) is given. Using the above 
example, the message (among other things) that will be typed 
is: 


ome coe ep CEE cP ates a o CP GE GS ae CREP EE ARI OUAIS SGRIPO-GEEHP E _E0£.— Br Coe EE ES CE a EE EB B 


The correct form is: l... 1014 DROP 


In general, the message will contain the QUEST pattern with 
labels, variables and parentheses stripped off. 


As a final bonus, if the user ever types question mark (?), a 
friendly reminder of the correct form is given. 


DEFINE('QUEST (QS) OP, QPA,QN, OVP, QL, QLOW, OHI, QI') 


ee RMCCMM MM MIR C EK M ENEEL 
| First define a utility function QUESTP(QS,QP) which will | 
| analyze the arqument string QS according to the QUEST pat- | 
| tern given by QP. It will fail if no match is found. | 
E PT EXEUNT II A | 
DEFINE('QUESTP (QS, QP) OP1,0S1') : (QUESTP END) 
Re ER CC CMMMMMLLMCAS ON 
{| Entry point: Break on an alternative and if one is found | 
| call QUESTP recursively. | 
—— ———————————————————— —— ——— ' — n———————ÀH s: 
QUESTP QP BREAK('1') . QPT ‘ft = sF (QUESTP_ 1) 
QUESTP (QS, QP 1) :S (RETURN) F (QUESTP) 
et ee MEER 
(| In QP we now have a single QUEST descriptor. Obtain the | 
| variable name (QN), the label name (QL) and the value pat- | 
| tern (QVP). | 
AAA A IN E La 
QUESTP 1 QP BREAK('(') . QN '(! = : F (FRETURN) 
QN = IDENT(QN)  'QDUMMY' 
QP  BREAK(')') . QVP "')' REM. QL 


We A MM TT oe a A A A A O A N 
( If QS matches the value pattern, branch to QUESTP_3 for | 


| the assignment. Convert QS if necessary to the proper | 
| type. l 
Ceo 

IDENT (QVP, 'ARB!) :S(QUESTP 3) 

QVP ARB . QLOW  '!...' REM. QHI :S(QUESTP 2) 

IDENT (QS,QVP) :S (QUESTP_3) F (FRETURN) 
QUESTP_2 QLOW = -INTFGER(QLOW) EVAL (QLOW) 

QHI = INTEGER (QHI) EVAL (QHI) 

QS = CONVERT (QS, '*' INTEGER!) :F (FRETURN) 

(LE (QLOW,QS) LE(QS,QHI)) : F (FRETURN) 

QUESTP_3 $QN = QS 

LABEL = DIFFER(QL) OL ? (RETURN) 
QUESTP_END 


A I GC CM E E | 
| Define a pattern (QUEST.QPA) which will extract from a | 
| QUEST descriptor, the inner QUEST pattern. ID.V will match | 
| an identifier assigning it to V. | 
———————— ——————— —— —— ———— —— —u——Á—À—À— e ———ásmeÜl | 


NEUT = BREAK('{() ') 

QUEST.QPA = NEUT '(' NEUT . OPA ')' (NEUT | REM) 
A = "'ABCDEFGHIJKLMNOPQRSTUVWXYZ ' 

ID.V = (ANY(A) (SPAN(A '0123456789 .!') | '')) . V 


: (QUEST END) 


ee pn SN 
| Entry point: After printing the message, interpret the | 
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| input. Errors are processed at QUEST 1. | 


QUEST LABEL = 
QS BREAK('/') . OUTPUT '/' REM . QP 
QI = TRIM(INPUT) ; OUTPUT = QI 

QUEST 1 QP ID.V '...' = EVAL(V) '...' :S(QUEST. 1) 

QUEST 2 QP '...' ID.V = '...' EVAL(V) :S (QUEST 2) 
(DIFFER(QI,'?') QUESTP(QI,QP)) :F (QUEST 3) 
DIFFER (LABEL) :S (RETURN) F (FRETURN) 


a ye ee ee ee ee eee CK MM ge ge ee 
| Extract and print the pattern and also indicate our | 
| feelings. l 
E Se ee eee 
QUEST_3 QP QUEST.QPA = QPA :S(QUEST. 3) 
OUTPUT = DIFFER(QI,'?') 
" RSENTENCF('Pad input, you «STUPID» <FOOL>') 
OUTPUT = 'The correct form is ' QP : (QUEST) 
QUEST_END 


Names referenced Name Type Where defined 


by QUEST: STUPID Syntactic Variable Program 17.1 
FOOL Syntactic Variable Program 17.1 


A | 
Program Let there be N stones in a pile (where N is 


E li 

E 17.3 E odd) and let each player take, on each move, 
11 BN either 1, 2, ... , or K stones from the 
E AAA A pile. When the pile is exhausted, the player 
with an odd number of stones wins. For example, if N=5 and 
K=2 we have a very simple game for which we can portray a com- 
plete decision tree as shown in Figure 17.1. 


By applying the previously described minimax procedure (or by 
using common sense) the tree indicates a victory for the first 
player, A. If the rules of the game are changed to make the 
winner the one with even parity, the game is victory for B, no 
matter what A does on the first move. 


The decision tree algorithm can be employed if the tree is 
sufficiently small but becomes quite impractical as soon as 
the game becomes nontrivial. To see this, let us fix K=2 and 
let N vary. The number of Lbranches, E(N), in the tree is given 
by the formula: 


E(N) = 2 + E(N- 1) + E(N - 2) 
which is immediately evident from the figure. While it may be 
an interesting exercise to solve this recurrence relation our 
purpose is served by simply noting that: 
E(N) > 2 * E(N - 2) 
so that 


E(N) > 2 ** (N/2) 
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Figure 17.1 


The decision tree for the stone game with N=5 and 
K=2. Player A goes first. At each node, three 
numbers indicate the number of stones left in the 
pot, the number of stones in A's possession and 
the number of stones in B's possession. Parens 
indicate a decision node for A, brackets indicate 
a decision node for B. 


which implies that E(N) is exponential. 


The decision graph on the other hand is quite well-behaved 
especially if we combine all nodes with the same parities for 
the two players. That is, for a given number of stones in the 
pot, we can group all nodes together such that the player 
about to pick has an even parity. In this way the number of 
nodes is only 2N and the number of branches is bounded by 2NK. 
Figure 17.2 indicates (within the limits of our artistry) the 
decision graph for the stone game (with K=2 and N=5). 


From the decision graph it is an easy matter for a program to 
compute an optimal strategy for a game of any N and any K and 
for either victory parity. A 2 X N decision array is allocated 
which corresponds to the nodes of Fiqure 17.2. The rest is a 
simple matter of using the QUEST routine. 
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Fiqure 17.2 


A decision graph for the stone game with K-2 and 
N=5. The nodes on the left are associated with 
Odd parity and those on the right with even 
parity. Parity refers to the parity of the player 
about to move. 


| The function SDA(NSTONES, PARITY,MAX) will create a Deci- 
| sion Array for the Stone game for a given number of stones 
{ (NSTONES). PARITY (0 or 1) indicates which parity wins 
( and MAX indicates the maximum number of stones that may be 
| taken per step. 

A a AE E E EE | 

DEFINE (' SDA (NSTONES, PARITY ,MAX)A,I,OPAR,P,J') 
: (SDA_END) 


Page 384 Chapter 1 - | GAMES 


CE GREECE I CMM a NC ILC EM E AN | 
| Allocate and initialize the array (SDA). SDA<N,P> in- | 
| dicates what to do if there are N stones left and you've | 
| got parity P. If there is no right decision, an 'L' for | 
| 


lose is given. t 
AA Lc S a MM E MM c | 
SDA SDA =  ARRAY('0:' NSTONES ',031' , 'L!) 

SDA<0,PARITY> = ‘wt 


SA NECI INC MMC M ee E E a E MCI ce gee E a | 
| For each stone (I) and for each parity (P), determine the | 
| strategy by finding which move (J) will end in a losing | 
| situation for the opponent. I 
e ———— ————— ———————— —  ——— á— € ———— ———ÀMÀÁÓÁ—— 


SDA 1 I = I + 1 LT(I,NSTONES) : F (RETURN) 
P = -1 

SDA2 P = P+ 1 LT(P,1) :F(SDA_1) 
OPAR = REMDR(NSTONES - I - P, 2) 
J = 0 

SDA3 J = J+ 1 LT(J,MAX) : F (SDA, 2) 
IDENT(SDAXI - J, OPAR>, 'L') :F(SDA, 3) 
SDACKI,P> = J : (SDA_2) 

SDA_END 


A A A A a A A O ECCE CI QE ECL KM 
| Main routine: The rules of the game follow the END label | 
| and are optionally printed (no sense boring the expert, he | 
f may be you). The rest of the program should be self- | 
| evident and will be given without further comment. | 
A POR A A ner res 

QUEST('Do you want the rules?/ (NO) NEWG{ (YES)') :S($LABEL) 


STONE_1 OUTPUT = INPUT :S(STONE 1) 
NEWG QUEST('No. of stones (odd) = /NSTONES(1... 1000) ') 
EQ (REMDR (NSTONES, 2) , 0) :S (NEWG) 


QUEST ("Winner's Parity (0...1) = /P(0...1) ") 
QUEST ("Maximum Take = /MAX(2... 1000) ") 


OLDG NS = NSTONES 
MAXA = MAX 
A = SDA(NS,P,MAX) 
HIM = 0 
ME = 0 
HIS_TURN 
OUTPUT = ‘There are ' NS ' stones in the pile.' 
MAXA = GT (MAXA,NS) NS 
QUEST ('How many do you want? /K(1...MAXA) ‘) 
NS = NS - K ; HIM = HIM +K 
EQ (NS, 0) :S (TOTALIZE) 
MY TURN 
K = A<NS,REMDR (ME, 2) > 
K = IDENT(K,'L') 1 
NS = NS - K 
ME = ME +K 
OUTPUT =  LETMESEE() 
S = K ' stones.' 
S = EQ(K,1) t just one.' 
OUTPUT = "TI think I'11 take" S 
EQ (NS, 0) :F(HIS TURN) 


TOTALIZE 
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OUTPUT = 'You have ' HIM ' stones and I have ' ME ' stones! 
FQ (REMDR (HIM, 2) , P) :S (HE. WINS) 
OUTPUT = "That means I win' 
OUTPUT =  INSULT() : (CHANGE) 
HE WINS 
OUTPUT = 'That means you win' 
OUTPUT = PRAISE() 
CHANGE 
QUEST('Would you like to change the game? /! 
+ ' (YES) NEWG | (NO) OLDG*) : ($LABEL) 
END 
Names referenced Name Type Where defined 
by STONE: QUEST Function Program 17.2 
PHRASE Package Program 17.1 
Epilogue 


It is necessary to be as complete as possible in the proces- 
sing of input information when the user of the system is 


someone other than the person who wrote the program. This is 
especially true here where presumably the user is the playful 
sort anyway. This was the reason for the creation of the 


variable MAXA whose purpose is to limit the value of the 
selection to the maximum of the stated limit and the pile. 


An example of a typical session with the STONE game is shown 
below. Underlined sections indicate the machine's responses. 


Do you want the rules? NO 

No. of stones (odd) = 13 
Winner's parity (0...1) = 0 
Maximum Take - 3 

Ihere are 13 stones in the pile. 
How many do you want? 3 


There are 8 stones in the pile. 
How many do you want? 1 


There are 4 stones in the pile. 
How many do you want? 3 
Ahh... WOW 


You have 7 stones and I have 6 stones 
That means I win 

E m m thick-skulled moron 
Would you like to change the qame? 1 
Bad input, you brainless ninny 
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[S vmm ITI 

li Program B The reader is presumed familiar with the 
E 17.4 li game of Tick-tack-toe whose popularity is 
(| TICTACTOE IN itself a puzzle since it is hard to do 
—— E: anything but tie. Nonetheless, it is in- 
teresting enough to illustrate several game-playing 
techniques. 


A complete decision tree for the game has nine possible 
choices for the first move, eight for the second, seven for 
the third, etc. Hence there are 9! (= 362,888) branches in 
the decision tree. Using SNOBOL4 and spending 10 milliseconds 
on each branch, one must spend 10 minutes of machine time to 
analyze the game, which is a bit much. When One considers the 
decision graph, however, there are only 39 = 19,683 possible 
boards and not every board is reachable by the rules of the 
game. Thus, there is a great deal of folding back. 


The pure tree-searching algorithm is actually quite simple 
since one need only know how to make a move and how to detect 
victory. That is, assume we write a routine, TTTV, to deter- 
mine the value of a board to, say, Player X (i.e. the one who 
marks X's in squares as opposed to O's) and another routine 
TTTM, which determines an optimal move for X. An arbitrary 
board is given to TTTV which first tests whether a winning 
combination exists. If so, the value of the board is self- 
evident. If not, it asks TTTM for the best move for player X. 
Upon getting it,  TTTV evaluates the board from the point of 
view of player O. It does this by interchanging O's and X's 
and calling itself recursively. It then returns the negative 
of the number so returned. The coding of TTM is even simpler. 
TTTM simply tries each move and asks TTTV to evaluate it (from 
the standpoint of player 0). This is not super efficient but 
it works. 


An algorithm based on the decision graph, on the other hand, 
may at first sight appear to be much more complicated re- 
quiring a complete graph description of the game. But we can 
let the computer do most of our graph-building as follows. 
Record each new state (new board position) that we come to in 
a table allocated for that purpose, and record with the table 
the move made. At each new situation, the table is consulted 
to see whether we've been there before. 


While these techniques are suitable for Tick-tack-toe, the 
search times become impractical for more complicated open 
games such as Chess and Checkers. To a first approximation, 
these games can be played with a truncated decision tree which 
means that the tree is searched to a limited depth and only a 
limited number of alternative moves at each level are 
considered. Samuel [1963] describes a Checker-playing program 
which also stores boards as in the decision graph algorithm. 
This permits the program to learn as it continues to play. 
Note that storing a particular state helps not only when 
returning to that state but in resolving the value of all 
states which can reach the remembered state. In the game of 
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Checkers the number of states that need be remembered can be 
reduced by considering all symmetries of a given board posi- 
tion. This is fully illustrated with the game of Tick-tack- 
toe. Thus if the proper response to: 


O | l O IXI 
——————4d—— is remembered -——dL7——34——— 
IXI to be: I XI 
— E A 

| 10 | | O 


then we should not have to recompute if 


I |o 


is encountered. 


Assume that boards are represented as strings, so for example 
the last board above is represented as: 


' OXO ' 


We can permute such a string very efficiently using positional 
transformations. But how many symmetries are there? Figure 
17.3 below illustrates the eight symmetries of the two- 
dimensional Tick-tack-toe board. 


O | I ! p O | l IXI 
= SSS SSS SSS 
| | X | | X | l l | 
Se Apo pe Áo E 
| l IXI | |o O | I 

O | | | | O IXI | | 
SSeS SS SS Bs cq EROS 
! | X | l | l | px 
=e es o a ps pesao peace oper £B 
| x | ! | | | O OA | 
Fiqure 17.3 


The eight symmetries of the Tick-tack-toe board. 


A method for producing these symmetries is found by noting 
that the upper four are 909 clockwise rotations of each other 
as are the bottom four. The first of the bottom group is found 
by flipping one of the top group completely over so that we 
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are looking at its underside. Thus, with two basic permuta- 
tions we are able, with the help of a little counting, to 
produce all eight. 


It is not always easy to determine the number of symmetries 
for some arbitrary board game. A method that may prove helpful 
is to consider the number of equivalent serializations of the 
points of the board. For example, we can serialize the points 
of Tick-tack-toe in the order indicated in the diaqram below: 


11213 
445 4 6 
— —— 
71819 


An equivalent  serialization would require that we begin at 
some corner (there are 4) and that we proceed along some edge 
(given the corner, there are 2 possibilities) and sweep the 
square one line at a time until all points have been touched. 
There are therefore 8 in all. 


Whereas before we could count approximately 20,000 different 
Tick-tack-toe boards, there are far fewer if we take into ac- 
count  symmetries. Unfortunately, if we wanted to determine 
exactly how many we could not simply divide 20,000 by 8 to ob- 
tain 2,500 as this would not allow for the fact that some 
boards rotate or flip into themselves. Though 2500 is a good 
lower bound, to find the exact number one must use Polya's 
theory of counting. See for example Harrison [1965]. We will 
be content with letting the program do the countina. 


In what follows we will define the functions TTTV and TTTM for 
the game of Tick-tack-toe. Given these functions, it should 
be an easy matter to write a complete program to play the game 
with a human opponent. Also, the program will play other games 
on the 3X3 board by simply changing the definition of losing 
pattern (LOS PAT). It will play other O-X games on different 
size boards by changing the definition of equivalent board 
(the function NEXTPD) as well as LOS PAT. These are left as 
exercises. 


TTTM remembers board positions by storing them in the table 
TTT. This table can be initialized with boards which block 
opponent victory (increasing efficiency) or with boards in- 
Gicating heuristic plays or standard openings. These options, 
too, are explored in the exercises. 


| 
| We first define a utility routine which cycles through all | 
( the boards equivalent to a given Tic-tac-toe board. It | 
| expects as argument the last board returned. NEXTBD can | 
| always be initialized by setting NEXT N to 0. | 
A A a RN ERE RR 
DEFINE ('NEXTBD (B) *) : (NEXTBD_END) 
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| RII a ay IDA NCC LP IN ye ke ee NC MN MC LE EIS CDL MC M QE AMD CRM CC NS | 
| Entry point: The first REPLACE is a clockwise rotation | 
| (done each time). The second REPLACE is a flip (done every | 


| four times). | 
| nc Cc e A A O E E | 


NEXTBD NEXT_N = EQ(NEXT N,8) : S (FRETURN) 
NEXT N = NEXT N + 1 
NEXTBD = REPLACE('701852963',' 123456789! , B) 
NEXTBD =  EQ(REMDR(NEXT N,l)) 
+ REPLACE ('321654987*,* 123456789! , B) 
: (RETURN) 
NEXTBD_END 


AR E A | 
| TITV(B) will determine the value of the board P to player | 
( X given that it is his move. It is presumed that he does | 
| not yet have a winning combination. | 
ee ee PETIERE e CERT" JA"rru E ES" - C RC LU Cn MMC CVM | 


DEFINF(*'TTTV (BOARD) ') 


LOS PAT =  POS(0) ('OOO' | 'O' LEN(3) 'O' LEN(3) 'o' 

+ | LEN(3) '000') 
: (TTTV END) 

TTTV NEXT N = 0 

TTTV = -1 
TTTV_1 

BOARD =  NEXTBD (BOARD) :F(TTTV 2) 

BOARD LOS PAT :S(RETURN)F(TTTV 1) 
TTTV 2 

TTTV = 0 


TTTV = -TTTV (REPLACE (TTTM (BOARD) ,'! XO! , ' OX!) ) : (RETURN) 
TTTV END 
Gr rg ey ee ee a a ee ee ee ee eee ee ae te . 
( TITM will find the best move that player X can make on the 
| given board. It first checks to determine whether it or 
{ any board similar to it was processed before. Old boards 
| are kept in the table TTT. TTTM actually returns the new 
| game state. 
A — ————————— ——s—— aü—— A— —— E ——À——— —Ó— —— À———— | 


DEFINE (' TTTM (BOARD) T, N, MAX ,V!) 


TTT = TABLE() 
: (TTTM_END) 
TTTM NEXT N = 0 
MAX = -2 
BOARD ' * :F (FRETURN) 
TTTM 1 BOARD =  NEXTBD (BOARD) :F(TTTM 2) 
TTTM = TTT<BOARD> 
DIFFER (TTTM) :S (RETURN) F (TTTM, 1) 
TTTM 2 BOARD (TAB(N) ARB) . T ' * AN - T 'X'  :F(TTTM 4) 
V =  -TTTV (REPLACE (BOARD, 'OX' , ! XO! )) 
MAX = GT(V,MAX) V :F(TTTM, 3) 
TTTM = BOARD 
TTTM_3 BOARD POS(N - 1) LEN(1) = ' 3 : (TTTM_2) 
TTTM_4 TTT<BOARD> = TTTM : (RETURN) 


TTTM END 
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E LT rn ee TR. QUT S | 

| #888 ame Theory | In concealed games, we have the added 
| £ £———————-3 complexity that our strategy may tip off 
| $ $$ | our opponent to our disadvantage. In any of the 
IX % | varieties of the game of poker, for example, aggres- 
{ #88 | sive betting may scare off an opponent who might 
SV otherwise stick and, in this way, fail to seduce him 
into betting more of his funds in a losing cause. It therefore 
pays to vary one's strategy and either not always bet agges- 
sively with a good hand or bet aggressively with a bad hand 
occasionally (the so-called bluff). Many people feel that 
behavior such as bluffing is incompatible with machine play. 
But as we will see, machines can do very well in a game such 
as poker and in fact can play truly optimal strategies. 


B 

I II 
a Tg 
Tt 1| |» -2 | 
A UH 
| | 
II | -2 | 4 | 
i 


Figure 17.4 


A two-person zero-sum game 


Let us take a hypothetical situation shown in Figure 17.4. 
There are two players, A and B, each with two possible moves, 
I and II. Each selects a move (unbeknownst to the other) and 
the matrix indicates how much B should pay A for each of the 
four possible outcomes. If the amount indicated is negative 
then the transfer of funds is in the direction from A to B. 
The game is called zero-sum because whatever one player wins 
the other loses; a situation which does not always exist in 
real life when, for example, a nuclear holocaust could be 
disastrous for both sides. 


How should A play the game? If he tries for the big payoff of 
4 by always selecting move II, B will catch on eventually and 
begin playing move I exclusively. Then A, seeing that he is 
losing 2 on each turn will begin selecting move I until B cat- 
ches on to that. Clearly both sides must play a so-called 
mixed strategy wherein their selection of I and II is un- 
predictable. Neither player should base their move on a 
strictly deterministic basis as this strategy may be uncovered 
by the opponent and exploited. This conclusion is perhaps in- 
tuitively implausible but one need only reflect on the penny- 
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A A UCM ee eres eee > EINE DER A e 


matching game to see the importance of not developing easily 
detectable patterns of play. 


Cer es S E 

{| Program |! As a fairly complicated example of a game- 
11 17.5 li theoretic approach, we will present a 
(| CARDPAK |l program which will play an optimal game of 
t— MMMM poker. Prior to presenting the game we will 


establish certain utility functions which may be useful not 
only in other forms of poker but perhaps in other card games 
as well. 


An important initial consideration is the choice of data 
representation. How should a card be represented? In SNOBOLAU, 
with its wealth of string operations, a natural choice is a 
single character. We will represent the 52 cards of the deck 
by the letters of the alphabet: 


'ABCDEFGHIJKLMNOPORSTUVWXY Zabcdefghi jklmnopqrstuvwxyz' 
The assumed ordering is: 
(2C 3C ... AC)(2D 3D ... AD) (2H 3H ... AH) (2S 3S ... AS) 


In principle, any 52 characters could have been used such as 
the first 52 characters of SALPHABET. In practice, debugging 
is easier if one uses printable characters. 


DEFINE ('RHAND(K, FLAG) *) 
DEFINE (' SUITS (H) *) 
DEFINE ('VALS (H) ') 
DEFINE ("DISPLAY (H) VALS, SUITS,V,S'°) 
a ee a E eg ee CE MEDICI HM M CC M C JI C M E II I EIC 


| Initialization of constant strings. | 
AA II E EA 


FULL_DECK = 
+ ‘abcde fghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPORSTUVWXYZ'! 
ALL_VALS =  'ABCDEFGHIJKLM' 
JUST_VALS = DUPL(ALL_VALS,4) 
JUST SUITS = DUPL('C',13) DUPL('D',13) DUPL('H', 13) 
+ DUPL('S', 13) 


< (CARDPAK END) 


URINE SS SS E En 
{ RHAND(K,FLAG) will return a random hand with K cards in | 
| it. If FLAG is nonnull, the deck will be reshuffled. If | 
| an insufficient number of cards remain, RHAND will fail. | 
ASAS A | 
RHAND RANDOM_DECK =  DIFFER(FLAG)  RPERMUTE (FULL_DECK) 
RANDOM_DECK  LEN(K) . RHAND =  <F(FRETURN) S (RETURN) 


A ee A et a ee eee ey eee HM pe es ee ey oe 
{ SUITS(H) will return just the suits for the hand H. | 
| Ec qq I EDEN a a a SNNT PU UMEN NI O A O 


SUITS SUITS = REPLACE(H,FULL DECK,JUST SUITS) < (RETURN) 


| VALS (H) will return just the values of the hand H. | 
vis VALS = REPLACE(H,FULL DECK,JUST VALS) < (RETURN) 

| DISPLAY (H) will return a string representing the hand H in | 
l a form consistent with conventional representations. | 
DISPLAY VALS = REPLACE(VALS(H),ALL VALS,'234056789TJQKA'!) 


SUITS = SUITS (H) 
DISPLAY 1 

VALS LEN(1) . V = : F (RETURN) 

V = IDENT(V,'T') *"'10' 

SUITS LEN(1) . S = 

DISPLAY = DISPLAY VIS! o! : (DISPLAY 1) 
CARDPAK END 
Names referenced Name Type Where defined 
by CARDPAK: RPFRMUTE Function Program 16.3 

ORDER Function Program 3.1 

[IIS Se ee EL LAETI TN 
(| Program [| As a prelude to finding an optimal strategy 
l1 17.6 11 of a game of poker we will write a function 
E POKEV li POKEV (HAND) which will evaluate a poker hand 
——______________J (5 cards) producing a number (very nearly) 


uniformly distributed in the range (0,1) and monotonically 
increasing with the strength of the hand. Thus, hand H1 is 
stronger than H2 if POKEV(H1) > POKEV(H2). The constraint that 
the numbers be uniformly distributed is very important to the 
successful operation of the optimal POKER-playing program. 
That is, the percentage of times that a hand H will be such 
that POKEV(H) < X must ke X or close to it. This is perhaps 
the trickiest part of the program. 


To begin with we find, via pattern matching, which of the 
several categories the hand falls into, eg. bust, pair, two- 
pair, three-of-a-kind (trips), etc. We set an array (POKEV_A) 
to contain probabilities that such hands are dealt. The 
probabilities can be computed or looked up in a source such as 
Epstein [1967]. We then need to resolve the question of where 
a given hand falls with respect to all other hands in its 
category (the variable FRACTION). This may be done crudely by 
regarding the values of the hand, sorted in descending order, 
aS a number in a base-13 radix system. Unfortunately (as the 
author learned by experience) the result is too inaccurate to 
lead to optimal play. Consider for example, bust hands. Few 
hands would have a lead value of 10 or less and no hands would 
have a lead value of 6 or less. Hence no hands would evaluate 
to .15 or less, a severe distortion. 


A solution is to consider the hand as representing a number in 
the combinatorial number system (see DECOMB, Prog. 15.2). This 
system has the property that the digits descend, just as re- 


quired. Were it not for straights, the representation for bust 
hands would be exact. 


For hands such as pairs, trips, two-pairs, fours, and full- 
houses we take the most significant designator (one or two 
cards) as a base-13 number and combine this with the remaining 
cards in a mixed residue fashion to obtain a final evaluation. 


DEFINE (' POKEV (H) VALS, SUITS, V,W') 
ae CO ee rg gi ng E ETS 
| Define patterns to detect major poker categories | 


-— ——————— —————— E E EEA O E AE. 
STRAIGHT_SEQ = REVERSE (ALL_VALS) SUBSTR(ALL_VALS, 13, 1) 


PAIR.V = LEN(1) $V *V 
TRIPS.V = PAIR.V  *V 
FOURS.V =  TRIPS.V ¥*V 


FLUSH.V FOURS.V *V 

"GC DC A EIC ICM CD A A DUE E CI MEC C ——— Lm 
| The following array gives the probability that a hand will | 
| fall within or lower than the indicated level. 0 is a | 
{| bust, 1 is a pair, etc. | 
—À—  M——Á———————————— vo P 2" —— — —À!———Á"——-—Á— H4 | 


POKEV A =  ARRAY('!-1:8') 
POKEV A«0» = 0.501 
POKEV_A<1> = 0.924 
POKEV_A<2> = 0.971 
POKEV_A<3> = 0.9924 
POKEV_A<4> = 0.9963 
POKEV_A<5> = 0.9983 
POKEV_A<6> = 0.99974 
POKEV_A<7> = 0.999985 
POKEV_A<8> = 1.0 


PR(L,PREFIX) is a utility function used by POKEV to com- 
pute the actual evaluation of the poker hand, assign it to 
POKEV and return. L is the level of the hand as in the 
above array. PREFIX is the secondary evaluation parameter 
and consists of zero, one or two cards (e.g., the 6 of 
trip 6's). For further resolution, the variable VALS con- 
tains the rest of the values in order of significance. 
These are regarded as a combinatorial representation of 
some number. 


DEFINE ('PR(L,PREFIX) COMBS, FRACTION, A!) : (POKEV_END) 


FU 
"J 


COMBS =  COMB(13,SIZE (VALS)) 

BASEB ALPHA = ALL VALS 

COMB ALPHA = ALL VALS 

FRACTION = (BASE10(PREFIX,13) * COMBS + DECOMB (VALS)) 
* / (13. ** SIZE(PREFIX) * COMBS) 

A - POKEV A 

POKEV = A<L - 1> + (AXI? - AXL - 1>) * FRACTION 

PR =  .RETURN : (NRETURN) 


MF re ME IM EMILE ep ae Ee eed OTe ee ER A LL cc QD ILC NN 
| Entry point for POKEV. Thanks to PR, our job reduces to a | 
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| simple matter of pattern matching. | 
AA A | 


POKEV VALS = REVERSE (ORDER (VALS (H) ) ) 
SUITS =  SUITS(H) 
STRAIGHT_SEQ VALS | ROTATER(VALS,- 1) :F (POKEV_3) 
SUITS FLUSH. V :S (PR (8) ) F (PR (4) ) 
POKEV 3 
SUITS  FLUSH.V : S(PR (5) ) 
VALS PAIR.V : F (PR (0)) 
VALS FOURS.V = :S(PR(7,V)) 
VALS TRIPS.V = :F(POKEV. 5) 
W = V 
VALS PAIR.V = :S(PR(6,W V))F(PR(3,W)) 
POKEV_5 
VALS PAIR.V = 
W = V 
VALS PAIR.V = :S(PR(2,W V))F(PR(1,W)) 
POKEV_END 
Names referenced Name Type Where defined 
by _POKEV: ORDER Function Program 3.1 
ROTATER Function Program 3.5 
REVERSE Function Program 3.6 
COMB Function Program 15.1 
BASE 10 Function Program 2.5 
CARDPAK Package Program 17.5 
DECOMB Function Program 15.2 
ES 
If Program || As the reader may be aware, there are many 
li 17.7 N forms of the game of poker; Draw, Stud (5 
it POKER E and 7 cards), Baseball, Blind, etc. There 
td may be wild cards and there may be any num- 


ber of players. We will pick the simplest game, viz. cold-hand 
five-card poker between two players with nothing wild. This 
choice is dictated by the simple fact that it is the only 
poker game that has been fully analyzed (Cutler 1975] and for 
which an optimal strategy exists. The reader may obtain  ad- 
ditional references to the analysis of this game from Cutler's 
paper or from a cited bibliography, Findler [ 1972]. 


In cold-hand poker, each player enters an ante into the pot 
and is dealt a hand (best thought of as a number in the range 
(0,1)) and the players take turns betting, checking, calling, 
raising and folding. Briefly, checking and betting are done 
when the pot contains equal contributions from both players 
(such as at the start or after a check). Calling, raising and 
folding are done when it is up to one of the players to 
equalize the pot. If he does not, he folds, forfeiting his 
Yight to the pot. If he calls, there is a showdown. A raise 
is a call followed by a bet. The set of possibilities are 
shown in Figure 17.5 where the first player is designated X 
and the second is Y. Note that Check-raises are not permitted. 


Call Call 
A A 
| l 
( ——— no 1 
r- Bet >| Y 1 Raise———»| X | Raise———>... 
| Lt UL cJ 
| 1 | 
c— v v 
{x | Fold Fold 
LJ 
| 
| r-a an 
t———Check——>]| Y | Bet >| X { >Call 
us td 
| | 
v v 
Check Check 


Figure 17.5 


The allowable bet sequences of cold-hand poker. 


In the game given by Cutler, the value for all bets is the 
current value of the pot. The value of a raise is found by 
decomposing the raise into a call followed by a bet. We will 
extend the game somewhat by allowing the player to set the 
value of the bet  (before-hand) to any fraction of the pot. 
Whereas all poker games require some limit, most games do per- 
mit players to bet any amount up to this limit. It has been 
conjectured that any bet short of the limit is suboptimal so 
that it might be reasonable to allow the player to make sub- 
maximal bets. But then the strategy, particularly when to 
fold, would have to be changed. 


The derivation of the optimal strategy is beyond the scope of 
the current discussion. TO obtain a flavor for the analysis, 
consider only the case where the first player, X, may check or 
bet and the second player, Y, either calls or folds. Since 
Y's move ends the game, he has nothing to conceal from X and 
so he plays a pure strategy of calling on all good hands 
(anything above a certain value called the call line) and fol- 
ding on poor hands (anything else). Now consider X's situa- 
tion. On very strong hands, X has nothing to lose by betting. 
On his average hands he has very much to lose if he bets since 
he would have to square off against Y's ketter hands. On the 
other hand, if he has an absolutely rotten hand, his only hope 
of winning is to bluff Y. Though be stands to lose more if 
caught bluffing, his expectation, it can be shown, is larger 
than if he stood the certain loss of a showdown with Y. The 
pattern of this simple situation holds in all the more complex 
cases, viz. a bet on all hands above a certain level and a 
bluff on all hands below a certain level. Also the bluff must 
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be in a fixed ratio R of the percentage of legitimate bets 
where R depends on the bet limit. 


We list here for convenience, various parameters used by the 
poker program. 


L = bet limit as a percentage of the pot. 

R = the bluff ratio (L / (1 + B)) 

A = the initial betting line for player X. X bets on hands 
greater than this. He checks on hands worse, except that 
on his lowest (1 - A) * R hands he bluffs. 

B = the call line for player X after the sequence Check-Bet. 
Below this line he folds. He has no other options. See 
Figure 17.5. 

C = the betting line for player Y after X checks. Below this 
line, player Y calls except for the lower R * (1 - C) 
hands which he bluffs. 

D = The call line for player Y after X bets. Above this 


line, he will call (except for the very good hands which 
he bets) and below this level he will fold (except for 
the bluffs). 


The astute reader will note that the game can go on in- 
definitely whereas we have provided parameters for only a 
finite number of situations. The parameters ALPHA and BETA 
below serve to bridge the gap between the finite and the in- 
finite as they provide rules for extrapolating out to the Nth 
raise. 


ALPHA = the raise___attenuation factor. Given that the 
opponent's best strategy is to raise with his best P 
hands, then our best strategy is to respond by raising 
on our best P * ALPHA hands. Note that the raise at- 
tenuation factor for a round trip is ALPHA? and this 


factor is actually used in the program. 


BETA = the lion factor. Given that my optimal strategy is to 
bet (or raise) in the upper P hands, then, if my oppo- 
nent responds by raising, I will fold below the BETA * 
P line (unless I'm bluffing). (1 - BETA is sometimes 
called the chicken factor.) 


ER UE IM CM M DEDE a CMM A A 
( The function ABCDR(L) will set the global variables A, B, | 
| C, D and R as well as the parameters ALPHA and BETA. It | 
l is assisted in this by the functions ALPHA(L) and BETA(L) | 
| which compute ALPHA and BETA respectively. | 
 ——————— ———ÁÁ— ————— —————-—!———J——X——níeq——— 

DEFINE ('ABCDR (L) THETA, PHI, TAU, TTR'!) 

DEFINE (‘ALPHA (L) T') 
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DEFINE ('BETA (L) T*) 
: (ABCDR_END) 


__—_A—R— AAA A A A ——— a 
| Entry point for ALPHA: l 


ALPHA T = 1+2*L 
ALPHA = -(T + 1) + SQRT(T ** 2 +6 * T + 1) 
ALPHA = ALPHA / (2 * T) : (RETURN) 


| Entry point for BETA: | 
LIIlllll V cena E rm cM E REM C E EU ee A | 


BETA T = 1*«2*L 
BETA = -(T ** 2) #+2%* T+ 14 (T- 1) * 
+ SORT(T ** 2 + 6 * T + 1) 
BETA = BETA / (2 * T ** 2) : (RETURN) 


E A REO | 
| Entry point for ABCDR: | 
A IN A E IN A O | 


ABCDR ALPHA = ALPHA (I) 
BETA = BETA(L) 
PHI = L/ (1+ 2 * 1) 
THETA = 1 - PHI 
TAU = 1*2*L 
R = L/ (1+1) 
TTR = TAU * THETA / R 
A = -1 + 2 * PHI + ALPHA + TTR * (4 * PHI + 2 * ALPHA) 
A =A / (TAU * THETA + ALPHA + TTR * (2 * ALPHA + 1)) 
B= 4 * PHI + 2 * ALPHA - (2 * ALPHA + 1) * A 
C = 2 * PHI + ALPHA - A * ALPHA 
D = R * (1 + ALPHA) - R * ALPHA 


: (RETURN) 
ABCDR_END 


NS 
| BET() will compute the amount which can be bet with a | 
{ given limit L. | 
—————S———————————c——————————— —PQ 


DEFINE ('BET() *) : (BET. END) 
BET BET = CONVERT(POT * L, 'INTEGER') 
BET = GT(BET,HIM) HIM 
GT (BET,0) :S (RETURN) F (FRETURN) 
BET END 
Now for the POKER program proper. Given the mnemonic 
labels, the use of QUEST, and the discussion in the text, 
comments are virtually unnecessary. The request for the 


lucky number is simply a device to warm up our random num- 
ber generator so that identical hands will not always be 
dealt. 
CŘ 
OUTPUT = ‘Welcome to Cold-hand Poker! 
QUEST (‘Would you like to know the rules? /! 
+ ' (YES) | (NO) INIT*) 2S ($LABEL) 
PLOOP OUTPUT = INPUT : S (PLOOP) 
INIT 
- QUEST('What is your lucky number today?/RAN_VAR (1...1000) ') 
HIM = RANDOM(100) + 20 
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OUTPUT = “We'll start you off with " HIM " chips" 
NEWP QUEST('Bet limit (X of pot) = /L(10...1000)*) 
L = L / 100. 
ABCDR (L) 
ANTE QUEST ("What's the ante? /ANTE(1...HIM) ") 
START  GT(ANTE, HIM) :S (ANTE) 
POT = 2 * ANTE 
HIM = HIM - ANTE 
OUTPUT = ‘With a ' ANTE ' chip ante the pot has ! 
* POT * chips! 
HX = RHAND(5,1) 
X = POKEV (HX) 
HY = RHAND(5) 
Y =  POKEV (HY) 
OUTPUT = 'You are dealt ' DISPLAY (HX) 
RAISE = (1 - A) * ALPHA 
CALL = 1-D 
QUEST ('Would you like to bet (B) or check(-)? /' 
+ ' (B) HE_BETS| (-) HE CHECKS!) : S ($LABEL) 
HE_CHECKS OUTPUT = LETMESEE () 
(LE((1 - C) * R,Y) LT (Y,C)) :S(I CHECK) 
I BET BET = BET() :F(CANT BET) 
POT = POT + BET 
OUTPUT = "I guess I'll bet " BET " chips." 
QUEST ('How about you, call(C) or fold(F)? /' 
+ ' (C) | (F) I WIN!) :S($LABEL) 
HE CALLS POT = POT + BET 
HIM = HIM - BET : (COMPARE) 
I_CHECK OUTPUT = "I'11 check too" : (COMPARE) 
HE_BETS BET = BET() *F(CANT BET) 
POT = POT + BET 
HIM = HIM - BET 
OUTPUT = ‘You bet ' BET ' chips.' 
OUTPUT = LETMESEE() 
GT(Y,1 - RAISE) :S(I RAISE) 
GT(Y,1 - CALL) :S(I CALL) 
LT(Y,R * RAISE) :S(I RAISE)F(I. FOLD) 
I RAISE OUTPUT = "I'll see your " BET " chips" 
POT = POT + BET 
BET = RBET() :F (CANT_BET) 
OUTPUT = " and raise you " BET 
POT = POT + BET 
QUEST ('You must now raise(R), call(C) or fold(F) /' 
+ ' (R) | (C) HE_CALLS | (F) I WIN!) :S ($LABEL) 
HE RAISES OUTPUT = 'You call my ' BET ' chips and! 
HIM = HIM - BET 
POT = POT + BET 


CALL = RAISE * BETA 
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RAISE = RAISE * ALPHA * ALPHA ; (HE BETS) 
I CALL OUTPUT = 'OK, I call' 
POT = POT + BET < (COMPARE) 
CANT_BET OUTPUT = 'Since you have no money left we ! 
+ "have to stop here! 
COMPARE OUTPUT = "Let's see, I have " DISPLAY (HY) 
GT (X,Y) :S (HE WINS) 
I WIN OUTPUT = 'I guess I take all ' POT ' chips in the pot! 
OUTPUT = INSULT () : (SUMMARY) 
I_FOLD OUTPUT = T'I fold' 
HE_WINS HIM = HIM + POT 
OUTPUT = You win the ' POT * chips in the pot! 
OUTPUT = PRAISE() : (SUMMARY) 
SUMMARY OUTPUT = 'You now have * HIM * chips' 
OUTPUT = EQ(HIM,0)  'So Long! < S (END) 
QUEST ('Same game (S) or new parameters (N)? /' 
+ ' (S) START] (N) NEWP!) : (S$ LABEL) 
END 
Names_referenced Name Type Where defined 
by POKER: QUEST Function Program 17.2 
SQRT Function Program 15.6 
POKEV Function Program 17.6 
CARDPAK Package Program 17.5 
Epiloque 
The following session was actually obtained using the above 


poker program. 
by the machine. 


What_is your lucky number today? 177 
We'll start you off with 120 chips 
Bet limit (% of pot) = 100 

What's the ante? 10 


e AAA EE EE A O CEES 
0 A AL E AA O CI RATO ARA O 


Would you like to bet(E) or check(-)? - 
I need time to meditate about this problem 
I'll check too 


=> q o ise IER RUD GU EDO AO EE ED ce 


As usual, underscored items indicate responses 


Thank you very much for the game, I enjoyed your brilliant 
effort 
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Same game (S) or new parameters (N)? S 


ES PTET o SEE AE REELS aii CD ED cta 


OK, I_call Se eR a CL D ee 
Let's see, I have JS 8D KC 5H 5D 


Your heavy-handed performance befits a silly ass 


Not all games are this brief. With lower betting limits, op- 
timal play calls for generally more betting. The most complex 
bidding sequence resulted with a bet limit of 10% of the pot. 
The player was dealt two-pair and bet ruthlessly. The machine 
also bet heavily raising three times before calling. The 
machine had a full house. In general, however, the machine is 
very conservative and most bidding sequences are quite short. 


The use of the 'lucky number’ ruse to initialize the random 
number generator is common but entirely unnecessary if one has 
the time-of-day available to him. The time of day is actually 
available in many SNOBOL's, though not in the original. 


Though the reader may be expected to understand most of the 
routines in this book, the equations used in the function 
ABCDR to compute these parameters are probably not in this 
category. At this writing, this is their only appearance in 
print. 


ETT CDU s | 

{ Exercise 17.1 | Assume a machine and a player would like to 
t-—————————-—A play cards. If the player shuffles and 
deals, the machine may be cheated. If the machine randomly 
generates hands, the player could be cheated. How can a one- 
way cipher be used to ensure a fair deal? 


cap eg eae eae 

| Exercise 17.2 | Assume one had a program to play penny- 
A matching such that the program attempted to 
find patterns in the play of the opponent. Assume that there 
were no randomizing component in the program but that it was 
strictly deterministic. Is there a strategy which will beat 
such a program? 


E | 
| Exercise 17.3 | Categorize and describe the decision graph 
AV for the following game. Player A places 


$10 in the pot and player B places $1 in the pot. First it is 
player A's turn and he can bet $1 whereupon B must call or 
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fold. If B folds, A takes the pot. If he calls, he matches 
Ats $1 and it remains A's turn. The procedure continues until 
A choses not to bet whereupon they roll a die. 1 or 2 is vic- 
tory for B; 3, 4, 5 or 6 is victory for A. 


E ee 
| Exercise 17.4 | Write a function PHRASE(LIST) where LIST is 


AS a list of names separated by commas which 
will, for each name NM in the list, (1) define a function by 
that name and (2) compile code so that the function returns 
RSENTENCE('<NM>'). In this way, for example, 


PHRASE (' INSULT, PRAISE, LETMESEE') 


could take the place of the function definitions given in 
Prog. 17.1. 


(Rm 

| Exercise 17.5 | Some variables cannot be used in a QUEST 
LAS descriptor (Prog. 17.2). Give a simple rule 
to prospective QUEST users so that they may avoid any dif- 
ficulties. How would you modify QUEST so that a diagnostic 
can be given. 


oe 

| Exercise 17.6 | One of the reasons that QUEST was written 
3 with a separate utility function QUESTP was 
so that it could be easily modified to handle extensions of 
the following kind. Extend QUEST so that several arguments 
may be supplied separated by commas. QUEST patterns are then 
any combination of QUEST descriptors joined by the operators 
comma (+) and alternation (|) with comma having higher 
precedence. Also allow parenthesis in such expressions. 


| NE oa pre eg ee 

| Exercise 17.7 | Extend QUEST so that it accepts, in addi- 
(AÑ tion to number ranges, letter ranges of the 
form (C,-Cg) where C, and C, are single characters. 


Ko ee 

| Exercise 17.8 | The game of NIM is such that there are four 
(> piles Of 1, 3, 5, and 7 stones. Each player 
may take any number, including all, of any one pile. He must 


take at least one stone, however. The person forced to remove 
the last stone loses. There is an optimal strategy for NIM 
which guarantees a win for the first player which is based on 
converting the numbers to binary and exclusive-ORing on a 
digit-by-digit basis. There are also optimal strategies if 
the game is extended to selecting from any K piles; one then 
uses a K+1 system; see Ball [19€2]. 


But the game can easily be perturbed so that the optimal 
strategies can't be used. Examples include placing a limit on 
the number of stones or requiring that an even number be fol- 
lowed by an odd. Of course, such rule changes do not 
invalidate a decision graph approach. For these reasons, if 
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not for the sheer joy of doing so, write a function NDA(S) 
which will prepare and return a NIM decision array. S will be 
a string of initial-pile numbers such as '1,3,5,7'. Assume 
the one-pile no-limit restriction on betting. 


| 7 

| Exercise 17.9 | Modify the function SDA (of STONE (Prog. 
tL—————————————34 17.3)) so that the variable MAX designates 
a list of possible moves separated by commas. For example, 
MAX = '1,3,5' means that 1, 3 or 5 stones may be selected. 

Opi a ee a S 

| Exercise 17.10 | Amaze your friends with this one. Modify 


C STONE so that the player can insert, in 

place of the parity, a predicate P(N) which will determine 

whether or not the player (opposing the machine) wins. Thus: 
EQ (REMDR (N, 2) ) 


as the predicate P(N) indicates that the player will win if he 
has an even number of stones. Also 


(GE(N, 5) LE(N, 10)) 


indicates that the player will win if his total is within the 
range (5,10). 


A ee a ee ae oe 
| Exercise 17.11 | How many symmetries are there to the 4x4x4 
AAA Tick-tack-toe game (i.e. classic 3-D 


Tick-tack-toe) ? How about a 3X3X3 board? 


AAA. 
| Exercise 17.12 | Modify TTTM and TTTV and rewrite NEXTBD 
(AS for the following game. The board is 


3X3X3, moves are like Tick-tack-toe and a winning pattern is: 


X | (x 


on any of the 6 sides or in any of the 3 slices parallel to a 
side through the middle or in any of the 6 slices through the 
diagonal. 


| DRIED EM CK MT ere nae | 

| Exercise 17.13 | Consider a three-dimensional cube,  3X3X3 
t——-— with one corner  subcube removed leaving 
exactly 26 subcubes. How many symmetries of this cube are 


there? 
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{ Exercise 17.14 | With the help of QUEST and a nice board- 
_______________§ printout function, complete the Tick-tack- 
toe game (Prog. 17.4). 


Cae re ee oe ee 

| Exercise 17.15 | One way of speeding up TICTACTOE is to not 
WS look further when a move is found which 
results in victory. Implement this (Hint: it requires adding 


one instruction to TTTM.) 


CAES EE = EE 

| Exercise 17.16 | To play 3D Tick-tack-toe on a 4X4x4 board, 
t——————— one needs to limit somehow the depth of 
search. If the depth of search is limited, one needs a 


heuristic for evaluating a board. Use the following scheme. 
Assume that it is X's move. For every X find the lines passing 
through it not already blocked by an O. If it stands by itself 
in a line add 1. If it stands with another add 3. If it 
stands with two others, add 10000 or some other such large 
number as this would imply victory. Do a Similar evaluation 
for O and subtract the two amounts. Modify TTTV to use this 
evaluation whenever the global variable FNCLEVEL reaches the 
value of the keyword &FNCLEVEL. The global variable is of 
course set by the main program. 


E ee ee ee eee c 

| Exercise 17.17 | Let H be a hand of cards as in CARDPAK. 
AAA Suppose we wish to sort the cards in the 
order of increasing value (ignoring suits). How could the 


function ORDER be modified to accomplish this? 


a ne pee ee ee ae 

| Exercise 17.18 | Modify the CARDPAK functions so that they 
t——— are operative with a pinochle deck (48 
cards, Ace-9 (twice) of each suit). 


p SOR Ta.4p tw SP A 

| Exercise 17.19 | A bridge hand is evaluated for high-card 
AÑ points by assigning 4, 3, 2, 1 points 
respectively to the A, K, Q, J. In two statements, randomly 
shuffle and deal a hand, and determine and print its value. 
You may use COUNT (Prog. 3.4). 


E pe eee ee 
| Exercise 17.20 | Modify POKEV (Prog. 17.6) so that it 
AAA evaluates a three-card poker hand. Note 


that straights and flushes do not count extra but that a 
straight-flush counts higher than either a pair or trips. Use 
the values 0.83, 0.955, and 0.978 as the probabilities of get- 
ting a bust, a pair or lower, and three-of-a-kind or lower 
respectively. 
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E A 
| Exercise 17.21 | If we were playing with three decks, so 
t-—————— that duplicate and triplicate cards could 
actually be obtained ina single hand, POKEV would no longer 
be monotonic. Why? How would you modify POKEV so that it 
would work with any number of decks? 


Co" ee A ee m 
| Exercise 17.22 | Write a function POKUNVAL which will be an 
AS approximate inverse of POKEV. That is, 


given a real number in the range (0,1),  POKEV(POKUNVAL (X)) 
should approximate X. 


Cee ie ee ee 

| Exercise 17.23 | POKEV is not especially uniform over the 
>» range of hands categorized as  two-pairs. 
Fix up POKEV so that it regards (W V) as a number in a com- 
binatorial number system rather than in a radix system. 


[lv ise > = CTUM 

| Exercise 17.24 | Assuming that both players are playing op- 
A» timally, label the branches of the flow- 
chart for cold-hand poker (Figure 17.5) with comparisons of 
the values of their hands against expressions involving the 
parameters A, B, C, etc. Modify POKER so that it plays an op- 
timal game for X, rather than Y. 


RN UT 

| Exercise 17.25 | If we were not concerned with losing op- 
tLL————————————-A timal behavior, we could, by adding just 
one statement to POKER (Prog. 17.7), permit the player to bet 
any amount up to the maximum allowed. Give an example of such 
a statement and indicate where it should be placed. 
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, t ; : 
r— he development of the stored-program machine is 
l 

l 

l 


thought to be of importance because it allows a 
program to modify itself. Today, index registers ob- 
viate the necessity for a program to be self-modifying 

L3 so that the practice is not only considered non- 
important (witness the growth of pure procedure) but is 
considered harmful as an obscuring practice. The real and 
lasting significance of stored program is that it allows 
programs to produce other programs (if most machines still had 
plug-board control, the output of a 'compiler' would have to 
be a wired-up plug-board or a wiring diagram and a congenial 
and dextrous computation staff). 


al 


It is therefore no coincidence that assemblers began appearing 
at about the time of the first installations of stored-program 


machines (circa 1950) and compilers (originally called 
automatic coders) and interpreters began to be developed 
shortly thereafter. This marked for the first time in the 


history of mankind the development of artificial languages; 
languages which would be literally and unfailingly obeyed by a 
mechanical servant; languages whose constructs and convolu- 
tions are subject oniy to the requirement that a translation 
algorithm be written for the language. Alas, this turns out 
to be one of the major obstacles to creating languages which 
are powerful and congenial, since it is no simple task to 
describe how to convert an arbitrary language into efficient 
code. This not only makes it difficult to implement large 
languages efficiently, but also makes it difficult to formally 
describe a large lanquage. 


This chapter is devoted primarily to the task of describing 
how language translators of one kind or another can be written 
using the SNOBOL4 language. Compiling and assembling are 
primarily string processing activities and so it is not sur- 
prising that SNOBOL4 should be particulary helpful along these 
lines. But actually it is by no means obvious how to employ 
the powerful pattern matching operations to parse languages. 
In fact, Griswold (1974, p. 11] says that "patterns derived 
from grammars are of little use in such [i.e., parsing] 
problems." We will show, on the contrary, that we can almost 
directly map a formal grammar into a parsing pattern and that 
SNOBOLU patterns are particularly applicable to the parsing 
task. 


Traditionally, SNOBOL processors have had a tendency to be big 
and slow and for this reason applications have tended to hover 
about the periphery of linguistic translation in such chores 
as bootstrapping, pre-processing, macro pre-passes and in 
general software which has a small user population and high 
development costs. But the more recent implementations of 
SNOBOLU (viz. SPITBOL, SITBOL and FASBOL) have greatly exten- 
ded the practical application of  SNOBOL!U while the great 
proliferation of languages and machines has extended the need 
for such applications. Also, SNOBOL4Y has often been used to 
teach compiler-writing because it simplifies the task suf- 
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Machine M is a word-addressable machine with 32 bits per 
word. All instructions have the format: 


[ME E UTE EU PUE 


| OP-code { AC | X | A | 
————————— — — ————————— rere 
Bits 0-7 8-11 12-15 16-31 


There are sixteen general purpose registers which can 
serve both as accumulators for arithmetic and as index 
registers for address modification. The AC (accumulator) 
and X (index register) fields are four bits for the pur- 
pose of specifying one of these sixteen registers. The 
maximum number of words for the machine is 2!6 so that the 
A (address) field can specify absolutely any address in 
the machine. The effective address, E, for any instruction 
is the sum of the index register (X) plus the value of the 
A field. We will refer to the contents of location E as 


C (E). If E is less than 16, a register is the assumed 
location. If the X field is 0, no indexing is assumed. 
Thus, Reg. 0 cannot be used as an index register. In the 


description of OP-codes which follow, AC will refer to the 
accumulator referenced by the AC field. 


Mnemonic Code Instruction 
(Hex) 

LOAD 21 Load C(E) into AC 

STORE 22 Store AC into location E 

ADD 31 Integer add C(E) to AC 

SUB 32 Integer subtract C(E) from AC 
MUL 33 Multiply C(E) to AC (Overflow lost) 
DIV 34 Integer divide C(E) into AC 
FADD 71 Floating add C(E) to AC 

FSUB 72 Floating subtract C(E) from AC 
FMUL 73 Floating multiply C(E) to AC 
FDIV 74 Floating divide C(E) into AC 
LOADA 2A Load effective address E into AC 
LOADN 2F Load -C(E) into AC 

BR AO Branch to location E 

BRGT A1 Branch to E if AC is > 0 

BRLT A2 Branch to E if AC is < 0 

BREQ A3 Branch to E if AC is = 0 

BRNE AQ Branch to E if AC is # 0 

BRGE A5 Branch to E if AC is 2 0 

BRLE A6 Branch to E if AC is < 0 


0 


Figure 18.1 


A description of machine M. 
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ficiently to allow the student to complete a compiler in a 
term. By using SNOBOIU many of the by-now routine tasks of 
lexical and syntactic analysis are quite easily accomplished 
permitting attention to be focused on more difficult aspects 
of the translation task. 


Since we will be involved in this chapter with assembling and 
compiling it will be helpful to fix on a particular machine. 
The machine whose instruction set is described in Figure 18.1 
will be referred to as machine M. It will be used as an exam- 
ple machine throughout. 


PA ar aa 

(| Program |l ASM is an assembler for machine M. Each word 
IM 18.1 E of the machine can be represented by 32 bits 
11 ASM E or 8 hexadecimal digits or, if &ALPHABET has 
C _A=>-) size 256, Y characters. We will presume that 
our assembler is only required to punch hexadecimal digits on 
cards, one word per card. Other output formats are rather 


easily obtained using conversions from Chapter 2. Our assembly 
language will consist of instructions in the following format: 


Label Op AC,A(X) Comment 


The four fields indicated are separated by blanks. Absence of 
a label is denoted by a blank in column 1. If AC (and/or the 
comma) is missing, 0 is assumed. If the '(X)' is missing, 0 
is assumed. The comment may be missing; if the Op field is 
present, the operand (3rd) field must also be present. If the 
Op field is missing, no instruction is generated; thus labels 
May appear on separate lines. The Op field may contain any 
Mnemonic shown in Figure 18.1. 


Perhaps the most important single observation one can make 
about an assembler is that it is inherently a two-pass system. 
This is because it is impossible to assert a maximum length 
for the sequence: 


STORE ALPHA 


ALPHA 


Hence addresses such as ALPHA are resolved in the first pass 
based on their location; instructions are translated on the 
second pass. 


The essence of assembling is associative look-up. There are 
two distinct reasons for this. It is (by definition) easier 
to remember a mnemonic such as 'LOAD' than an op-code such as 
"214. But aside from this it is necessary to have symbols 
(such as ALPHA in the above sequence) whose meaning is 
resistant to perturbations of the program (such as insertions 
or deletions of instructions). The associative lookup is nor- 


mally accomplished in most assemblers with the help of some 
form of symbol table as described in Chapter 11. In SNOBOL4, 
we will use the TABLE datatype to serve this purpose. 


| Ez GE E RS A A O NS a RN 
(| This is a simple assembler for the machine M (Figure 1). | 
| First we initialize a table (OPS) with the operators and | 


| their codes. { 
E AS | 
LIST = 'LOAD 21,STORE 22,ADD 31,FADD 71,SUB 32,' 


+ 'FSUB 72,MUL 33,FMUL 73,DIV 34,FDIV 74,LOADA 2A,LOADN 2F,' 
+ 'BR AO,BRGT A1,BRLT A2,BREQ A3,BRNE AU,BRGE A5,BRLE A6,' 


OPS =  TABLE() 
OPS INIT LIST BREAK(' ') . OP * ' BREAK(',') . CODE *,! = 
+ :F (INIT1) 

OPS<OP> = CODE : (OPS_INIT) 


a A A a O a | 
| Initialization for Pass 1. SYMS is a table to hold user | 
| symbols. LOC is our location counter. We assume I/O unit | 


| no. 10 is available for scratch storage. | 
A ES E EIA AAA NE a | 


INIT? SYMS = TABLE() 
LABEL.L = BREAK(' ') . L SPAN(' *) 
Loc = 0 


OUTPUT (. DISK, 10) 


A ag ee te ne ae re MEM CN CN ee ee ee 
{ Loop for pass 1. Evaluate all symbols. | 
oL E ———— PáÁáÍ P — Á— — — K——————— dr— — — — Á————— —À—— Tee —(—A | 


PASS1 X = INPUT ' ' : F (INIT2) 
DISK = X 
X LABEL.L = 
SYMS<L DIFFER(L)> = BASEB (LOC, 16) 
LOC = DIFFER(X) Loc + 1 : (PASS1) 


A A ae 
| Initialization for pass 2: set up a big pattern | 
| (P.OP.AC.A.X) to crack fields. | 
eT ee SN a ee A ET | 
INIT2  REWIND(10) 

DETACH (. DISK) 

INPUT (. DISK, 10) 

NO OP = POS(0) BREAK(* ') SPAN(' ') RPOS(0) 

P.OP.AC.A.X = NULL $ OP $ AC $A $ X NULL . CAUSE 


+ POS(0)  BREAK(' ')  SPAN(' *) 

+ BREAK (' ') . OP SPAN(' !) 

+ (BREAK (! ,') . AC ',' | NULL) 

* BREAK('( ') . A 

+ (*(* BREAK(')') . X !')' | NULL) 


We define a generalized convert-symbol routine (CVTSYM) 
which converts a symbol according to a given symbol table 
(TABLE) producing a hex string of length LENGTH. TYPE in- 
dicates the type of symbol for diagnostic purposes. CAUSE 
is a glokal error-bearing variable which is printed on the 
listing. 'Uf* means undefined symbol in field f.  'Lf' 
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| means length of field f is too long. 1 
| ————,—!!!!!!!!!'! | 


DEFINE ('CVTSYM(SYM, TABLE, LENGTH, TYPE) *) :(CVTSYM END) 


CVTSYM SYM =  INTEGER(SYM)  BASEB(SYM, 16) : S(CVTSYM 1) 
SYM = TABLE<SYM> 
CAUSE = IDENT(SYM,NULL) 'U' TYPE ' ' 
CVTSYM_1 
SYM = LPAD(SYM,LENGTH,'0') 
CVTSYM =  LE(SIZE(SYM), LENGTH) SYM  :S(RETURN) 
CAUSE = CAUSE 'L' TYPE ' ' 
SYM = : (CVTSYM_1) 
CVTSYM_END 


| ING MN D MINNIE np ep a pega Cee TA AAA. 
| We now go into the pass 2 loop. We tentatively set our | 
{ error indicator (CAUSE) to syntax error (S). í 
| p Ic ——À—— —— — — PÁ—  —me | 


PASS2 CAUSE = 'S' 
LINE = DISK ' ' : F (END) 
LINE NO OP : S (PASS2A) 


LINE P.OP.AC.A.X 
=  CVTSYM(OP,OPS, 2, 'O!) 
AC = CVTSYM(AC,SYMS, 1, 'R') 


X CVTSYM(X, SYMS, |, X!) 

A CVTSYM(A, SYMS,U,'A') 

PUNCH = OP AC X A 

OUTPUT = RPAD (CAUSE, 15) OP * * AC ' 'X' '! A 
+ ' ' LINE : (PASS 2) 
PASS2A OUTPUT = DUPL(' ',32) LINE : (PASS2) 
END 
Names referenced Name Type Where defined 
by ASM: RPAD Function Program 3.3 

BAS EB Function Program 2.4 

Epilogue 


Note that when an error occurs an instruction is generated in 
any case with one or more fields zeroed. This is so that sym- 
bols that are resolved by the assembler will have their cor- 
rect value and that an assembly with one or two small errors 
may nonetheless be a valid assembly for debug purposes. 


The assembler is a very primitive one lacking many ‘bells and 
whistles' of a commercial product. Extensions such as data 
generation statements, expressions, relocatability, psuedo- 
ops, conditional assembly and multiple-location counters can 
be added, however, without a major overhaul of the program 
structure. For a more detailed discussion of assembler im- 
plementation, see Donovan [ 1972]. 


Compiling using SNOBOL4 _ Page 411 


Coo ey E ap) oe ye ey A ee Oye 
€%% ompiling using SNOBOL4 | There has been much written 
AAA On the subject of compilation 
{ and parsing in the past several years. Much of this 
{ writing is theoretical and most is devoted to a 
€£%% | thorough analysis of parsing; i.e., the decomposi- 
CS tion of an input into its linguistic components. For 
example, the recognition that the source language string: 


A = BETA * C * DELTA 
is of the form: 
VARIABLE = EXPRESSION 


and that EXPRESSION is of the form TERMI + TERM2 and that 
TERM2 is of the form FACTOR * FACTOR, may be regarded as 
parsing the original string. Parsing is an essential component 
in the translation not only of computer languages but of 
natural languages as well. 


It has long been recognized, however, that parsing comprises 
only a portion of the compilation process and not the dominant 
portion by any means. This is especially true in SNOBOL4 where 
pattern matching makes parsing quite automatic, as we will 
see. On the other hand, techniques for generating efficient 
object code from a fully parsed statement are not well under- 
stood and are often embedded in compiler listings and nowhere 
else. Some of these methods have been distilled into English 
and can be found in Gries [1971], Donovan [1972], Graham 
[1975] and McClure [1972]. 


We have introduced in a previous chapter the BNF (Backus Nor- 
mal Form) for representing sets of strings or languages. As 
an example, the grammar shown in Figure 18.2 can be used to 
define a simple language which we will refer to as Li. Li 
contains only assignment statements, the four fundamental 
(binary) arithmetic operations, and negation. Identifiers 


<IDEN>: : =<LETTER> | <IDEN><LETTER> ( <IDEN><DIGIT> | 
<INTEGER>: :=<DIGIT>|<INTEGER><DIGIT> | 
<PRIMARY>: <=<IDEN>(<INTEGER>] (<E>) | 
<FACTOR>: : =<PRIMARY> |-<PRIMARY> | 
<TERM>: : =<TERM>*<FACTOR> (<TERM>/<FACTOR> | <FACTOR> | 
<E>: :=<E>+<TERM> [<E>-<TERM> | «TERM» | 
<STMT>: : =<IDEN>=<E> | 


C————————————————'—————————— ————mRE€ | 


Figure 18.2 


A BNF description for the language L4. 
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We will assume that the reader is already acquainted with BNF. 
He has undoubtedly been exposed to this or similar notation 
when -learning the constructs accepted by a programming 
language or indeed any other linguistic system such as an 
operating system command language or an editor's command 


language. This notation can be directly mapped into SNOBOLU 
patterns so that any syntactic variable is associated with 
some pattern. In fact Exercise 18.9 invites you to write a 


program to carry out this translation automatically. 


One difficulty with a BNF description is that languages that 
it is used to describe are typically not context free. Thus 


A(3) = 17 


may or may not be valid in Fortran depending on declarations 
for A. Pure BNF cannot be used to decide the issue. Such 
context dependencies are generally treated by the addition of 
a symbol table, with appropriate insertions and checks; in 
this way the language can be treated as context free, even 
though it is in fact not. Dynamic function evaluation can be 
used in SNOBOLU to make these checks. Thus, for example, if 
the function ATEST(X) will test to see if its argument is an 
array and if ID is a pattern to match identifiers, then 


ID $ X  X*ATEST(X) 


will match only array identifiers. The function ATEST() can 
be written using symbol tables as were needed in ASM. Routines 
such as ATEST() are often erroneously referred to as semantic 
routines. They are not, for their purpose is to extend a con- 
text free formalism to handle context sensitive situations. 
It would be more correct to use the term syntactic routine for 
any routine used to decide syntax. We will reserve the term 
semantic routine for routines which have a side-effect other 
than recognition such as code production or error-message 
generation. 


The semantics of a language described using BNF, i. €. the 
meaning of the various linguistic constructs, are seldom 
defined formally. For the language L,, for example, we may 
say that all arithmetic operations represent operations on in- 
tegers of a precision equal to that of the target machine. 
Most readers, especially those already exposed to Fortran-like 
languages, will then understand the meaning of L,. While this 
is true of a simple algekraic language it may not be true if 
the language is neither algebraic nor simple. Formal systems 
to describe semantics are of two kinds, concrete and 
theoretical. A concrete system is one which has been subject 
to the rigors of machine implementation; a theoretical system 
is one which purportedly could be, but which for some reason 
has not. Concrete systems (listings) are messy; theoretical 
systems are at least buagy and at worst severely distorted. 
The answer to this dilemma may lay in the development of 
compiler-compilers which compile inefficiently and produce 
inefficient code but which yield sufficiently simple listings 
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that they may be understood. Much of this chapter is dedicated 
to the ultimate fullfillment of this pious hope. 


iow eee ge ee 

| Program || L ONE is a compiler for the language IL, 
li 18.2 (| (Figure 18.2). The output is in the form of 
11 L_ONE {| assembly language (accepted by ASM) for 
AA MMM] machine M (Figure 18.1). The implementation 


of L_ONE is based on a method of employing semantic routines 
during a pattern match, a technique suggested to the author by 
M. J. Rochkind (Bell Laboratories, Raritan River, N.J.). This 
method is based on the observation that a routine invoked to 
generate code (as opposed to one used to supplement the match 
as aiven above in the case of ATEST) is best done using con- 
ditional assignment. This defers any code production until 
after the match thus guarding against premature production. 


For example, consider the pattern 
P1 . *A() P2 . *B() | P3 . *C() (18. 1) 


If P1 and P2 match, then A() and B() are called. If P1 match- 
es and P2 fails but P3 matches, then only C() is called. A () 
is not called in this case because backup on failure removes 
the conditional assignment as was fully described in Chapter 
Ta This is, of course, exactly what we want and will greatly 
reduce the complexity of a compiler written in SNOBOL4. The 
reduction in complexity is worth the fact that we are using 
conditional assignment in a way completely unintended by the 
originators of the language. Functions called in this way are 
supposed to be returning names and receiving values; they do, 
but the names are dummy names and the values assigned are 
irrelevant. 


It will be more convenient to have only one semantic routine, 
viz. S (name), where name is the name of a routine. Thus, 
instead of writing 

P1. *A() 
we will write 


P1. *S ('A!) 


But this is a bit messy, so we will write a routine S(name) to 
return NULI . *S (name) so that we may write 


Pil S('A!) 


to achieve the same effect with a cleaner appearance. The 
above pattern (18.1) is then written: 


P1 S('A') P2 S('B!) LO P3 S('C!) 


Finally, we can scan and push an element all in the same pat- 
tern by the construction: 


PAT . *PUSH() 


where PAT matches the string pushed (See PUSH, Prog. 5.5). The 
semantic routines produce code by popping the stack for the 
location of the previous result, producing code to compute a 
new result, and pushing onto the stack the location of the new 
result. 


WP KDE II M M DCN CD DOE M NO EMI LN Oe eee ee ICA EM EM Ee eee NON 
| The program L_ONE will compile statements of I, into as- | 
| 
l 


sembly language for machine M. In the semantic routines | 
below, there is a label S_op for each operation op. | 
A  —————— ———— RE 


DEFINE ('S (NAME) ') 


DEFINE ('S_ (NAME) T*) : (S. END) 
S S = EVAL("NULL . *S ('" NAME "*)") : (RETURN) 
S. S_ = „DUMMY 2 ($ ('S_' NAME)) 
S NEG OUTPUT = ' LOADN ' POP() 

OUTPUT = ' STORE ' PUSH (TEMP ()) : (NRETURN) 
S ADD ;S_SUB ;S MUL ;S DIV 

T = POP () 

OUTPUT = ' LOAD ' POP() 

OUTPUT = ' ' NAME ' ' T 

OUTPUT = ' STORE * PUSH(TEMP()) : (NRETURN) 


S ASGN OUTPUT 
OUTPUT 


' LOAD * POP() 
STORE ' POP() : (NRETURN) 


S END 


| EMNINEDEESSN A SRM A ae ee ee Gt EE CC AG CMM CMM GE MP MES IE ICE MESE) 
I The following patterns will match the syntactic variables | 
f of the language I, and call the appropriate semantic | 
| routines. | 
AFA EE AAEE PE ee ETT a SEE E p 


LET =  'ABCDEFGHIJKLMNOPORSTUVWXYZ' 

DIGITS = '0123456789! 

IDEN =  (ANY(LET) (SPAN(LET DIGITS) | '')) . *PUSH() 

INTEGER = SPAN(DIGITS) . *PUSH() 

PRIMARY = IDEN | INTEGER | '(' *E !)" 

FACTOR = PRIMARY | '-' PRIMARY S('NEG!) 

TERM = ¥*TERM '** FACTOR S('MUL') | 
+ *TERM '/' FACTOR S('DIV') | FACTOR 

E = *E '«' TERM S('ADD!) | 
+ *E '-' TERM S('SUB') {| TERM 

STMT = POS (0) IDEN ‘t= xE S('ASGN!) RPOS (0) 
Rt MMC QM LG DEL ay ee ee Se MM A een ee Og eee ee ee 
| TEMP() is always ready to provide us with a new temporary | 
| location. | 
AA A A 

DEFINE (' TEMP () ') : (TEMP END) 
TEMP TEMP NO = TEMP NO + 1 

TEMP = 'TEMP' TEMP NO < (RETURN) 


TEMP_END 


ee eee a ae eg Ve ge ee MD CM E ee ee ae e ee O AN 
| The main program is essentially a single pattern match. | 
| 


READ S = TRIM(INPUT) :F (END) 
REMOVE_BLANKS D E :S (REMOVE BLANKS) 
TEMP NO - 0 
S STMT : S (READ) 
OUTPUT = '!*** ERROR IN ' S : (READ) 
END 
Names_referenced Name Type Where defined 
by L ONE: PUSH Function Program 5.5 
POP Function Program 5.6 


As a simple example, the input 
A = B-C*D 


will produce the output 


LOAD C 
MUL D 
STORE TEMP1 
LOAD B 
SUB TEMP 1 


STORE TEMP2 
LOAD TEMP 2 
STCRE A 


The resulting code is clearly non-optimal but it gets the job 
done. There are numerous extensions that one can incorporate 
into L_ONE to produce more efficient code and to provide more 
features. Some of these have been left as exercises. 


The reader should not be misled by the simplicity with which 
L CNE was written into believing that full-fledged compilers 


for complete languages can be had cheaply. In general, the 
complexity of a compiler will grow nonlinearly with the in- 
troduction of new features. The world is full of compiler- 


compilers that look good for toy languages but which don't 
quite stand up to the hammering of a full scale language such 
as, for example, PL/I. The mere fact that declarations in PL/I 
can follow use is enough to discourage the one-pass approach 
used in L CNE. For big compiling, we must step back a bit and 
proceed in stages. 


A 
| $*** artitioning the compiler | A compiler is generally 
(€ AAA decomposed into lexical 
| *€* | analysis, syntactic analysis, code optimization and 
| $ ( code generation. The latter two are often inter- 
| $ | twined in more than two passes for good reasons, as 
EW we shall see later. The first two of these phases 


is indicated in Figure 18.3. 


GER GE DAA A «UND GUESS) O ene Qu Gino. TA A ow ee a eS Ge eens ae E > <P AA. AAA A ele ED ee SS SE A 


(a) ALPHA = BETA + GAMMA ** 2 


SSS. a a a en oe 
(b) [ALPHA({ = ([{BETA{( + [{GAMMA/{ ** (| 2 | 
AA Land a Sd 


! 
(c) | | 
I 


777-73 
| GAMMA | | 2 | 
| AAA | | a | 


Figure 18.3 


A lexical analysis (b) and a syntactic analysis 
(c) of an input string (a). 


Lexical analysis decomposes the source string into indivisible 
tokens (or atoms). These tokens are, of course, not literally 
indivisible since they are, after all, composed of characters, 
but they are indivisible in the sense that no further decom- 
position has any meaning with respect to compilation. Thus, 
the meaning of 'ALPHA' is not a composition (homomorphism) of 
the meanings of its individual characters (though its sound 
may be). On the other hand, the meaning of 'ALPHA + BETA! can 
be interpreted as a composition of the meanings of the three 
tokens 'ALPHA', '*' and 'BETA'. The distinction is very much 
like the distinction between morpheme and phoneme in the study 
of natural languages. It is actually a kind of mixed radix 
system whereby a relatively small number of different symbols 
(letters or phonemes) is used to compose a fairly large (but 
finite) number of different notions (words or morphemes). 
Sentences are then built from the words. Evidently there are 
more ideas than sounds. 


When SNOBOLU is used to compile a programming language, no 
distinct lexical pass is required. On the other hand, the in- 
put may have to be massaged (pre-processed). In L ONE this 
amounted to removing blanks. In a real language such as For- 
tran, blank removal is not nearly so simple as we will see 
(BLANKS, Prog. 18.3). In PL/I the pre-processing may consist 
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of the extraction of the next statement (see PLI.STMT, Prog. 
8.10) and the removal of comments. Redundant blank removal is 
not nearly so necessary for PL/I as it is for Fortran (since 
identifiers cannot be split in PL/I). 


The result of a syntactic analysis is the tree structure shown 
in Figure 18.3. This tree structure may be represented in any 
of a variety of ways, most commonly as a linked structure. In 
SNOBOLU the tree is perhaps best represented as a string in 
Polish prefix form (as described in Chapter 9) because pattern 
matching may then be exploited to effect desired transforma- 
tions. 


It is convenient to separate out that portion of a compiler 
which is machine-dependent simply to avoid duplication of ef- 
fort if the same compiler is needed for a different target 
machine. The tree structure of Figure 18.3 is clearly machine 
independent, and code generation is clearly machine-dependent. 
What of code optimization? 


According to McClure [1972], the two most effective means of 
code optimization are common subexpression removal (from ad- 
dress calculations) and register allocation. An example of 
the first is the removal of the common subscript calculation 
in: 


A(I,J) = A(I,J) + 1 


Removal of common subexpressions is machine independent and 
can be effected by transformations applied to the tree struc- 
ture. On the other hand, register allocation is clearly 
machine dependent and must be done at some later stage. 


It is very common to have some intermediate machine- 
independent form between the tree structure and the resulting 


code. This is to push the machine independence as far as 
possible. Hence the intermediate form is a kind of least com- 
mon multiple of all machine languages. The original macro 


implementation of  SNOBOLU was actually written in such a 
language. The most extensive (or perhaps intensive would be a 
better word) of this kind known to the author is being 
developed by Robert Dewar (Ill. Inst. of Tech., Chi., I11.) in 
connection with a machine-independent implementation of 
SPITBOL. Dewar's motivation is to produce a macro language 
which will lose little to efficiency when expanded on a given 
machine. 


One of the more common intermediate forms is the four-tuple. 
Four-tuples consist of an operation followed by two operands 
followed by a destination all separated from each other by a 
convenient break character such as a comma. For example: 


ADD,L1,L12,L3 


would mean add the contents of L1 and L2 and store the result 


into L3. 
other locations. 


MUL,A(TEMP2) , TEMP3, TEMPU 
would reference as the first 


the current value of TEMP2. 
M code as: 


LOAD 1, TEMP 2 
LOAD A(1) 
MUL TEMP 3 
STORE  TEMP4 


An optimized version of this 
initial LOAD or the STORE. 
TEMP2 and the destination of 


We will assume that the locations can be indexed by 
For example: 


argument the location A offset by 
This could ke rendered in machine 


code may not actually contain the 
This will depend on the origin of 
TEMPU. 


Hence we may decompose a large processor into the following 
phases (as opposed to passes since several phases may actually 
go on in the same pass). 


1. 
2. 
3. 
» 
5. 


Pre-processsing 

Syntactic analysis 

Tree transformations and global optimization 
Intermediate language production 

Final expansion and detailed optimization 


f BV AN ON 
(| Program |! The function BLANKS 
B 18.3 N processing that may be required when  com- 
{| BLANKS E piling a full language. BLANKS (S) will 
———————————— I remove blanks from a Fortran statement 
provided as argument. We assume a function such as FORTREAD 
(Prog. 9.2) is available to read in a statement and handle 
continuation. Removing blanks sounds simple but is complicated 
by the fact that blanks within string literals may not be 
removed. A string literal in ANSI Fortran has the form 


is an example of pre- 


nH<n-characters> (eg. 3HCAT) 

String literals may only appear in FORMAT and CALL statements. 
But we cannot simply go looking for this pattern jin such 
statements because the indicated pattern may appear as part of 
an identifier (which may also be an argument Of a subroutine 
call). For example: 


CALL ALPHA (A 1H) 


contains no literal. Hence we must ignore such seguences which 
follow alphabetics. Another problem is that blanks may be in- 
terspersed in and around the length indicator. For example: 


1 2 HABCDEFGHIJKL 
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is a valid literal. This makes it difficult (but, as we will 
see, not impossible) to write a single pattern to match a 
literal. 


If we depart from the relatively rarified air of the ANSI 
standard and enter the domain of a practical compiler, we 
encounter more problems. IBM's OS/360 Fortran [IBM 360j] is 
typical of many Fortrans and so we will assume this to be our 
source language. With respect to blank removal, this Fortran 
has the following additional properties: 


(1) A literal may be designated by the sequence '...' as 
well as by the nH<n-character> sequence. 


(2) Function calls (as well as subroutine calls) may  con- 
tain literals. 


(3) The READ and WRITE statements may be direct access in 
which case they have the form: 


cmnd(f ' exp ... 


where cmnd is READ or WRITE, where f is an integer or 
an identifier designating a file and where exp is an 
arbitrary expression designating a record number. 


Now (2) implies that all arithmetic expressions (including the 
exp portion of (3)) can potentially contain literals. 
Therefore READ and WRITE statements must be handled specially. 
A logical IF statement has the form: 


IF( exp ) stmt 


Here we must check to see if stmt is a READ or WRITE statement 
but our check is complicated by the fact that in order to find 
stmt we must determine where exp ends. To do this we must 
maintain a parenthesis count ignoring parentheses that are 
within literals. This can be done by recursion in a manner 
reminiscent of the BAL function (Prog. 8.3). 


We might say a word at this point as to why we wish to go 
through so much trouble to remove blanks. For one thing, the 
blank removal process can be used not only for compiling but 
for many other kinds of pre-processing, data laundry, etc. 


that require pattern matching of Fortran programs. Hence it 
saves duplication of effort if it can be done once and for 
all. Another reason is that keywords, identifiers and many 


other non-decomposible units can have blanks interspersed 
within them (however improbable that may be) which will prove 
difficult to pattern match. For example, the keyword READ may 
be written as 'R EA D'; to match this we may write: 


OPTB 
READ 


SPAN (* ') | NULL 
'R' OPTB 'E' OPTB 'A' OPTB 'D* 


but this is as troublesome as it is inefficient. 


E gi a a a ig OC CM gt emia CM pe CR TS ae ee gee ee A Ne a AN 
| BLANKS (S) will return the result of removing blanks from a | 
| Fortran statement provided in S. BLANKS (S) will operate | 
{| correctly for OS/360 Fortran [IBM 360g]. The statement is | 
| presumed to have had its label removed by previous | 
| processing. | 
n —————  —————————————— ————————————————— —— a a A | 

DEFINE (' BLANKS (S) IF, KW, STMT, IO!) 

Q = coe ve 

ALPHA = ' ABCDEFGHIJKLMNOPORSTUVWWXYZ ! 

NUM = '0123456789! 


E E CO O E ee ee ey 
| FBAL will match a string balanced with respect to paren- | 
{| theses but will ignore parentheses within literals. We | 
| will use backup-free scanning (i.e. the ARBNO(P FENCE) | 
| construct) as established in Chapter 6. | 
AAA A A A AN ee eg 


BLINT = ANY (NUM) (SPAN(NUM ' *) | NULL) 

F.LIT = ELINT $N 'H' LEN(*DIFF(N,' *')) . LIT 
* | Q BRFAK(Q) . LIT Q 

ITEM 1 = F.LIT | SPAN(' ') | SPAN (ALPHA NUM ' ') 
+ | LEN (1) 

SEARCH.LIT =  POS(0) ARBNO(ITEM1 FENCE) . TEMP F.LIT 

ITEM2 = '(' *FBAL ')' | ITEMÍ 

FBAL =  ARBNO(ITEM2 FENCE) 


prev ——————M ——UÁÓ———M— 
| The function BL(S) will remove all blanks from S except | 
( those in literals. | 
RD SR CE ee AA O ET 


DEFINE ('BL(S) LIT, TEMP!) : (BL_ END) 
BL S SEARCH.LIT = :F(BL 1) 
BL = BL DIFF(TEMP,' ') "'" LIT "'"  : (BI) 
BL 1 BL = BL DIFF(S,' !) : (RETURN) 


BL END 


| GM MMC E CDMCC M Oe Ae ae MC ECCE RE CIC. 
{| Define some patterns to scan statements containing | 
| critical keywords. | 
O, 


KWORD.KW = POS(0) SPAN(ALPHA ' (') . KW 
IF.STMT = POS(0) ('IF(' FBAL ')') . IF REM. STMT 
IO.STMT = POS(0) (('READ' | 'WRITE') '(' 


+ BREAK(ALPHA NUM) SPAN(ALPHA NUM ' ')) . IO Q REM. STMT 
: (BLANKS | END) 


Wg gg IM MC E DIM M Cl CE (OLD MIN MC IC MD LM D Md CIN cM E ee | 
| Entry point for BLANKS(S); First remove blanks from the | 
| keyword to test statement type. | 
| —————— AAA AAA MN NE RR RENE 


BLANKS S KWORD.KW =  DIFF(KW,' ') 
PLANKS = S 
BLANKS IF.STMT = BL(IF) BLANKS (STMT) :S (RETURN) 
BLANKS  IO.STMT = DIFF (10,* *) "'" BL(STMT) :S (RETURN) 
BLANKS =  BL(S) : (RETURN) 
BLANKS_END 
Names_referenced Name Type Where_defined 


by BLANKS: DIFF Function Program 3.10 


(2 ee ee ee Ts 

(| Program Y! The method of invoking semantic routines 
E 18.4 E used in the coding of L_ONE is general 
E POL t! enough but not sufficiently convenient for 
tL aed very large languages Of, say, PL/I size. To 
see this, consider the tree decomposition of a language state- 
ment as shown in Figure 18.3. By means of S() a function may 


be called before and after each node of the tree with the se- 
quence of calls being made in left-to-right order. Moreover, 
every leaf of the tree may be pushed and these pushes are in- 
terspersed between calls also in a left-to-right fashion. we 
could hardly ask for anything better, or could we? 


The reader will find, if he does the exercises involving ex- 
tensions to L_ONE, that he will be forced to push and pop many 
different items in order to preserve quantities from the start 
of a syntactic unit across to its termination. For example, 
to produce code for IF<F>THEN<S> we must create a conditional 
branch across the THEN-clause. For this we will need to create 
a label which will be used in two places, before and after the 
<S>. Since <S> may be arbitrary including another IF<E>THEN<S> 
sequence the label cannot be assigned to a variable but must 
be pushed and popped. Now if the functional relationship fol- 
lowed the structural relationship we would regard IFTHEN as a 
Single node of a tree with two arguments <E> and <S>. The 
IFTHEN function would call the functions for XE» and <S> to 
obtain translations. This will prove to be more natural. The 
temporary-variable facility built into the function mechanism 
can be used instead of stacks and a somewhat cleaner implemen- 
tation results. In order to achieve a functional relationship 
conforming to the structural relationship the source string is 
converted into a tree form; our tree will be Polish prefix. 


To obtain a slightly richer language to illustrate the conver- 
sion process, we define an upward compatible superset of L, 
called Lo. This is defined in Figure 18.4. Unlike Lı, we must 
allow blanks as separators (not shown in the BNF) but we do 
not permit blanks within identifiers and numbers. This is much 
like the PL/I convention whereas I, followed the Fortran 
convention. 


The form of Polish prefix for any non-leaf (a node containing 
at least one descendent) is: 


operator:n,operand,, operands,...,operandn 


where each operand is itself a valid tree. The operator may 
not contain either of the two special characters colon or com- 
ma. For a leaf, the :n is absent and, of course, there are no 
operands. Thus: 


A+B* C becomes +:2,A,*:2,B,C 
and 
A * (-B) becomes *:2,A,-:1,B 


<ELIST>: :=<E>,<ELIST> | <E> 

<REF>: 2 =<IDEN> (<ELIST>) 

<PRIMARY>: :=<IDEN> (<INTEGER> | (<E>) | «REF» 
<FACTOR>: : =<PRIMARY> | -<PRIMARY> 

<TERM>: : =<TERM>*<FACTOR> |<TERM>-<FACTOR> | <FACTOR> 
<E>: :=<E>+<TERM> | <E>-<TERM> | TERM» | 

<RELOP> is one of ' >! 1<! té=t !'»-! tat ac! 
<BOOL>: : =<E><RELOP>< E> 

<IFSTMT>: : =I F<BOOL>THEN<STMT >ELSE<STMT>| IF<BOOL>THEN<STMT> 
<VAR>: 2 =<IDEN> | <REF> 

<ASGNSTMT> : : =<VAR>=<E> 


<STMT>: : =<IFSTMT> | <ASGNSTMT> 
AA ISA a PRU ERN 


Fiqure 18.4 


The language Ls. The definitions for <IDEN> and 
<INTEGER> are the same as for I, (Figure 18.2). 


This seems ugly but it will be easy to produce, scan and 
expand. 


A functional form such as A(B,C,D) will translate into: 
REF: 2,A, COMMA: 2,B, COMMA:2,C,D 


No distinction is made, at least initially, between an array 
and a function since declarations may follow first use. Note 
that the argument list is a sequence of 2-ary functions rather 
than a sinale n-ary. This form is easier to produce and just 
as easy to scan. 


To transform infix to prefix, we will use the conditional in- 
vocation of semantic routines as in L_ONE. Only two routines 
need be defined; CPUSH(STR) will conditionally push the string 
STR onto the stack (conditional upon the pattern being a part 
of an overall successful match). CPUSH(STR) actually returns: 


NULL . *S_('CPUSH', STR) 


where S_() is now written expecting an extra argument. The 
other routine is PCL(N) which causes N+1 items on the stack to 
be popped and replaced by one larger item, viz. 


OP:N, ARG, «ARG» eee e ARGn 


The operator is assumed to be the second last item on the 
stack. N is at least 1. 


Once the machinery of POL(N) and CPUSH (STR) have been set up, 
very large languages can be compiled with no additional seman- 
tic routines except error messages and routines to handle 
declarations. These we ignore for simplicity. We will il- 
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lustrate the method by writing a pattern which will transform 
sentences of Lə into Polish prefix. 


MEM MEC KK A aaa eT | 
| This program illustrates how to convert Lz into Polish | 
| prefix using special semantic routines, viz. POL(N) and | 
| CPUSH(S) for the purpose. We first define the semantic | 
| routines. | 
| ecc Ca c ———————— rl — ——— À——— EE e—P——MÉ—— | 

DEXP ("POL(N) = S('POL',N)') 

DEXP ("CPUSH (ARG) = S('CPUSH',ARG) !!) 

DEFINE ('S (NAME, ARG) *) 


DEFINE ('S_ (NAME, ARG) T1,T2') : (S_END) 
S S = EVAL("NULL . *S ('" NAME "','" ARG "')") : (RETURN) 
S. S = „DUMMY 2($('S_' NAME)) 
S POL T2 = POP() 
Tl = POP() ':' ARG ',' 
S POL1 (EQ(ARG,1) PUSH(T1 T2)) :S (NRETURN) 
ARG = ARG - 1 
T2 = POP() ',' T2 : (S. POL 1) 
S CPUSH PUSH (ARG) : (NRETURN) 


S END 
AAA A EC C C IER CCCII ÉRIC NM E M CM EIE DID CDS ee ee 
( We now write our patterns. Interspersed blanks are handled | 
{ by placing an optional blank pattern at the end of each | 
{| pattern primitive. Patterns formed from other patterns | 
| then need not worry about blanks. | 
A A A 

= 'ABCDEFGHIJKLMNOPORSTUVWXYZ' 
NU = '0123456789' 
BL = SPAN(* ty | NULL 
IDEN = (ANY(AL) (SPAN(AL NU) | '')) . *PUSH() BL 
INTEGER = SPAN('0123456789') . *PUSH() BL 

= ANY('*-') . *PUSH() BL 

MULOP = ANY('*/') . *PUSH() BL 


RELOP = (ANY (*=<>") | ANY('=><!) '=") . *PUSH() BL 
LP = '(' BL 

RP = ')' BL 

ELIST = *E (',* BL CPUSH('COMMA') *ELIST POL(2) | '') 


REF = IDEN LP CPUSH('REF') ELIST RP POL (2) 
PRIMARY = IDEN ( INTEGER | LP *E RP | REF 


FACTOR = PRIMARY | '-' . *PUSH() BL PRIMARY POL (1) 
TERM =  *TFRM MULOP FACTOR POL(2) | FACTOR 
E = *E ADDOP TERM POL(2) | TERM 
BOOL = *E RELOP *E POL (2) 
IFSTMT = 'IF' BL BOOL 'THEN' BL 
* (*STMT 'ELSE' BL CPUSH('IFELSE') *STMT POL(3) | 
* CPUSH('IFTHEN') *STMT POL(2) ) 


ASGNSTMT = (IDEN | REF) '-' . *PUSH() BL *F POL(2) 
STMT - IFSTMT | ASGNSTMT 
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Names referenced Name Type Where defined 

by POL: DEXP * Function Program 14.1 
PUSH Function Program 5.5 
POP Function Program 5.6 


* indicates name is referenced in the initialization section. 


Epiloque 
For example, if we execute: 


"IF A(I) > 6 THEN I = 2' STMT 
OUTPUT = POP() 


we will print: 


IFTHEN: 2,>:2,REF:2,A,1,6,=:2,1,2 


Gr ee A 

(| Program |i With a statement cast as Polish prefix we 
11 18.5 E may enter the optional tree-adjustment phase 
N TREE 1] in which the tree is scanned looking for 
q _ AAA patterns which may be pruned, modified or 


rearranged. There are several reasons for doing this, some of 
which are listed below: 


14 TO insert explicit conversions (for mixed mode arith- 
metic, array references, etc.). 


2s TO remove ambiguities (such as floating versus integer 
addition, binary versus unary minus, function 
references versus array references). 


3. Code optimization such as common subexpression removal 
or such as replacing <VAR> = <VAR> + 1 by a single 
operator. 


Other uses for the tree adjustment phase will occur to the 
writer of a practical compiler. An important point to note is 
that the scan is generally easier to apply to the tree than to 
any other form because it is quite easy to specify a pattern 
to match a tree. The following function, TREE(P,N), will 
return a pattern that will do precisely that. For example, 


TREE ('+*,2) $ OUTPUT FAIL 


is a pattern that will scan for and print all binary sums in 
Polish prefix form. 
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( TREE(P,N) will match a tree in Polish prefix form whose 

| node value matches the pattern P and where N is the number 

| of branches. The tree is assumed to be a non-leaf. If N 

l is 0, then an arbitrary number of nodes (up to some max- 

| imum) is implied. 

A ce A 
DEFINE ('TREE (P,N) ') 


ARP TREE = TREE(BREAK(':,')) | BREAK(':,') ',' 

+ : (TREE_END) 

TREE TREE = FQ(N,O) P 

+ (TREE(,1) | TREE(,2) | TREE(,3) | TREE(,4)) 

+ :S (RETURN) 
TREE = P "st N?!,? 

TREE 1 N = N- 1 GT(N,0) :F (RETURN) 
TREE = TREE *ARB TREE : (TREE, 1) 

TREE END 

Epilogue 


The alert reader will note that the pattern requires a ter- 
minating  ','. Thus, to use TREE on the Polish notation 
described above would require appending a comma to the total 
string. It may also be necessary to prepend a comma. For ex- 
ample, ARB_TREE is a variable which was set as a side-effect 
of initializing TREE to equal a pattern which will match an 
arbitrary tree. Then: 


POLISH = ',' POLISH ',' 
POLISH ',' ARB TREE $ T ARB *T 


will scan the Polish for a pair of identical expressions. (For 
this pattern match to work it will be necessary to use 
FULLSCAN mode; in QUICKSCAN mode, ARB indicates futility as 
was discussed in Chapter 7). Several examples of the use of 
TREE have been left as exercises. 


Cries ULLUS 

{| Program |! Given a statement in Polish prefix, we can 
E 18.6 0! generally produce compiled code by recursive 
E TR N invocation of a single translate function. 
—————— We will not produce code directly but will 


create four-tuples as described previously. The set of accep- 
table 4-tuples is indicated in Figure 18.5. 


Certain semantic ambiguities in the description of L, need be 
resolved before TR can be written. Floating point as well as 
integer arithmetic will ke permitted. We assume that iden- 
tifiers beginning with ANY('IJKLMN') are integer; all others 
are real (floating point). Mixed-mode arithmetic is not per- 
mitted. The functional forms specified in the syntax of Lə 
refer to array references; function calls are not permitted 
(but are left as an exercise). Finally, for simplicity, array 
references are assumed to be one-dimensional. The extension 
to multi-dimensioned arrays is relatively straightforward 
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ü-tuple Description 


ADD,arg1,arg2,arg3 Place arg1 plus 
arg2 into arg3 


Seven similar operations 
for SUB, MUL, DIV, FADD, 
FSUB, FMUL and FDIV. 


ASGN, arg1,,arg3 Move the quantity from 
arg1 to arg3. 
MNS,arg1,yarg3 Store -arg1 into arg3 
BR,,,arg3 branch to arg3 
BRGT,arg1,arg2,arg3 Branch to arg3 if 


arg1 is greater than arg2. 
Five similar operations 
for BRGE, BREQ, BRNE, 
BRLT and BRLE 


LBL,arg1 Insert a label here 


argn is of the form ID or ID(ID) where ID is 
an identifier. 


If identifiers are of the form TEMPn they are 
considered volatile; i. e., they may be destroyed 
after first use. 


A ee I DUI m UI Mm T A A EA es es MA B. 2. IP. P^ ^ 770 es ee es ese ul E E E oe 


Figure 18.5 


The tuple language. 
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given the standard multiplier technique [Gries 1971, Sect. 
8.0] but is beyond the scope of the present discussion. 


TR() will return a translation of a polish string con- 
tained in the global variable POLISH which is modified 
(and reduced to null) in the process. A trailing comma is 
appended to the Polish string to permit easier pattern 
matching. The translation is in the form of  ü-tuples 
separated by '//'. The language is Lo. 

Corr 


DEFINE ('TR (ARG) OP, N, P, T,1ID,L1,L2*) 


Gi a a Fee ee I REI OMA ene eg ee O A | 
| Pattern definitions: ITREE will match an integer tree. | 


| RTREE will match a real tree. l 
a EEEE EEEN AAAI E EE AE EE A EENIA ce LI AEN AEA AS | 


ITREE = ANY('#-*/"') ':' ANY('12') ',' *ITREE | 
+ ANY ('IJKLMN') BREAK(',:') ',' | 'REF:2,' *ITREE 
RTREE = ANY('+-*/') ':' ANY('12') ',' *RTREE | 
+ NOTANY ('IJKLMN') BREAK(',:') ',' | 'REF:2,' *RTREE 
: (TR_END) 


ae a E aR | 
| Entry point: if an operator, fan out; otherwise push the | 
| leaf. | 
| CAAA EEE IA A NA E II AA ee ee E | 


TR POLISH  POS(0) BREAK(':,') . OP ':' BREAK(',') . N 
+ ‘tos :S($('TR ' OP)) 
POLISH  BREAK(',') . *PUSH() ',' = : (RETURN) 


Go CENE M MM MEM M MI MC CIC C C EL CC C ("OIM CC MC ee 
| Arithmetic operators. | 
"Ner tK SOIN NRI a EE E QUERN RI IRR EA | 
TR * ;TR - ;TR * ;TR / 

TR = EQ(N,1) TR() 'MNS,' POP() ',,' PUSH(TEMP()) '//' 


+ : S (RETURN) 
'+ADD-SUB*MUL/DIV!' OP LEN(3) . OP 
POLISH POS(0) ITREE :S(TR, 1) 
OP = 'F' OP 
TR 1 T - TR() 
P = POP() 
TR = T TR() OP ',' P ',' POP() ',' PUSH(TEMP()) '//' 
+ : (RETURN) 


RS DO O E ag a O | 
| Array references | 


EL lll II A ASAS II IO ED T IAEA 
TR REF POLISH BREAK(',') . ID ',! = 


TR = TR() 
TOP() '(' :S(TR REF1) 
PUSH(ID '(' POP() *) *) : (RETURN) 
TR REF1 T = TEMP() 
TR = TR 'ASGN,' POP() ',,' T '//' 
PUSH(ID '(' T ')!*) : (RETURN) 
Relations are handled here. Note that '-' has been trans- 


lated by the TR IF... processor to 'EQ' to avoid ambiguity 
with assignment. An argument, ARG, contains the fail 
label. Success implies a no-op. Hence we need the com- 
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| plement of the given operation. | 
A A A —————— ——— a 
TR » ¡TR_>= ;TR «€ ¡TR_<= ;TR ^4- ;TR EQ 

'EQNE -=EQ «GE DLE «GT >=LT* OP LEN(2) . OP 


T = TRY) 

P = POP() 

TR = T TR() 'BR* OP ','! P *,* POP() ',' ARG '//' 
+ : (RETURN) 
ERES ILC EGGS KCN DES RC MCCC MM NCC CMM O, | 
{ Assignment | 
A A | 
TR_= TR = TR() TR() 'ASGN,' POP() ',,' POP() '//' 
+ : (RETURN) 


C —————————————————————————————————————————————————————————74 
| The IF's | 
A A A A tos 


TR_IFTHEN 
TR_IFELSE L1 = LABEL() 
POLISH POS (0) '=3;2' = 'EO:2! 
TR = TR(L1) TR() 
TR = EQ(N, 2) TR 'LBL,' L1 '//' : S (RETURN) 
L2 = LABEL() 
TR = TR 'BR,,,' L2 '//' 
+ 'LBL,' L1 '//* TR() 'LPL,* L2 '//' : (RETURN) 
TR END 


ee 
| LABEL() is like TEMP(). 
AA E | 


DEFINE ('LABEL() ') : (LABEL_END) 
LABEL LABEL NO = LABEL NO + 1 
LABEL =  'LBL.' LABFL NO : (RETURN) 
LABEL END 
Names referenced Name Type Where defined 
by TR: PUSH Function Program 5.5 
POP Function Program 5.6 
TOP Function Program 5.7 
TEMP Subfunction Program 18.2 
pu ee aM. XN 
(|! Program Il TUPLE (OP, ARG1,ARG2, ARG3) will expand a 
li 18.7 E ü-tuple (as described in Figure 18.5) into 
E TUPLE A reasonably optimized machine code. It does 
Y this by being "aware! at all times of the 
state of the registers and allocates and frees registers ac- 
cording to a primitive priority scheme. For examole, the 


tuples produced (by POL and TR) for the two statements: 


X = X + 1 
IF X > Y THEN X = X + A(I#1) + Z 


are shown in Figure 18.6 together with the instructions 
generated by TUPLE. Note that spurious LOAD's and STORE's 
which were present in L_ONE are gone. TUPLE assumes that any 


18.7 - TUPLE 
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temporary variable (of the form TEMPn) is only referenced once 


and is not used across statement boundaries. 


Y FADD,X,1,TEMP1 i LOAD 1,X ' 
¢ i FADD 1,=1 i 
| | l 
| ASGN,TEMP1,,X i STORE 1,X i 
| | | 
|  BRLE,X,Y,LBL. 1 i SUB 1,Y i 
| | BRLE 1,LBL.1 | 
| l | 
| ADD,I,1,TEMP2 i LOAD 1 ' 
$ i ADD 1,=1 i 
| | | 
| FADD,X,A(TEMP2),TEMP3 | LOAD 2,X i 
i i FADD 2,A(1) | 
| | | 
|  FADD,TEMP3,Z,TEMPU i FADD 2,Z i 
| l l 
| ASGN,TEMP4,,X | STORE 2,X I 
| | | 
| IBL,LBL.1 | LBL.1 | 


Figure 18.6 


The tuples produced by TR (on the left) and the 
corresponding code generated by TUPLE (on the 
right) for the statement sequence: X = X + 1 ; 


IF X > Y THEN X = X + A(I*1) + Z. 


The register allocation schemes used in actual compilers seem 
to be 'always messy'. TUPLE was written in a highly structured 
top-down fashion to avoid this. Note that the higher level 
routines have no notion at all of what the data structure to 
associate registers with locations looks like. Only low-level, 
caretaker routines, know this. This is an example of 
‘information hiding! as advocated by Parnas [ 1972]. 


DEFINE ('TUPLE (OP, ARG 1, ARG2, ARG 3) R') : (TUPLE_END) 


TUPLE :($(*TU ' OP)) 
TU ADD ;TU FADD ;TU SUB ;TU FSUB 
TU MUL ;TU FMUL ;TU DIV ;TU FDIV 
R = LOAD(ARG1) 
OUTPUT = ' ' OP ' ' R ',* ADDR(ARG2) 
DEASSOC (R) 
STORE (R, ARG3) : (RETURN) 
TU_ASGN R = LOAD(ARG1) 
STORE (R, ARG3) : (RETURN) 
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TU MNS R = REG() 

OUTPUT = ' LOADN * R ',' ADDR(ARG1) 

STORE (R, ARG3) 2 (RETURN) 
TU BR  ARG3 = INDEX (ARG3) 

OUTPUT = ' BR ' ARG3 : (RETURN) 
TU_BRGT ;TU_BRGE ;TU_BRLT ;TU_BRLE ;TU_BREQ ;TU_BRNE 

R = LOAD(ARG1) 

OUTPUT = ' SUB ' R ',' ADDR(ARG2) 

FREE (R) 

OUTPUT = ' ' OP ' t R ',' ARG3 : (RETURN) 
TU LBL OUTPUT = ARGI 

REG LIST = *," : (RETURN) 
TUPLE_END 


SN E EA | 
| LOAD (LOC) will load the indicated location (if not already | 
| loaded) into a register and return the register. | 
AS AE — sd | 


DEFINE (' LOAD (LOC) *) : (LOAD END) 
LOAD LOAD =  ISREG (LOC) : S (RETURN) 

LOC =  ADDR(LOC) 

LOAD = REG() 

ASSOC (LOC, LOAD) 

OUTPUT =  ' LOAD * LOAD ',' LOC : (RETURN) 
LOAD_END 


Oe pe ee Pee S ay pee eee MP ICE TET A Mg ae page ee 
{ STORE (REG,LOC) is a generalized store operation storing a | 
| given register REG into a given location LOC updating the | 
| register assignment list. | 
AAA a LA O IN 

DEFINE (' STORE (REG, LOC) ') : (STORE, END) 
STORE LOC = £INDEX (LOC) 

FREE (REG) 

ASSOC (LOC, REG) 

LOC TEMP_LOC :S (RETURN) 

OUTPUT = ' STORE * REG ',' LOC < (RETURN) 
STORE_END 


| ADDR(LOC) will return a usable address designating the | 
{ possibly subscripted location LOC. The address returned | 
| will be a register number if LOC is contained in a | 
| register. If LOC is subscripted, a register number | 
| replaces the subscript. If LOC is a constant, the symbol | 
| ‘=" is prepended. | 
AAA A O A | 


DEFINE (' ADDR (LOC) *) : (ADDR_END) 
ADDR ADDR = LOC 
ADDR = INDEX (ADDR) 
ADDR =  ISREG (ADDR) : S (RETURN) 
ADDR POS(0) SPAN('0123456789') RPOS (0) = 
+ "=! ADDR : (RETURN) 


ADDR END 


TA Eee O O TS ES ee EE GD ee is CES EE et E A EP O E ED ED cee CAE: «cupo SD ES GEE GP LO A CE O OUS 


E E A E ee ee pn A 
| INDEX(LOC) will load the subscript (if any) of the given | 
Į location into a register and return the same expression | 
| with the index replaced by a constant. | 
A ee OPERI 


DEFINE (' INDEX (LOC) S!) : (INDEX END) 
INDEX INDEX = LOC 
INDEX ‘'(' BREAK(')') . S = '(' LOAD(S) < (RETURN) 
INDFX, END 
e 
( The following five functions are low-level basic routines 
used to associate registers with locations. A string of 
register-location pairs is kept in the order of increasing 
priority in REG LIST. If a register is associated with a 


location then the value normally found at that location 
will be in the register. Also, if the location is a tem- 
porary, the location will not contain that value; other- 
wise the location will also contain the value. 


DEFINE ('REG () LOC!) 
DEFINE('FREE (REG) !) 
DEFINE (' ISRFG (LOC) ') 
DEFINE('ASSOC (LOC, REG) *) 
DEFINE (' DEASSOC (REG) *) 


NO_REGS = 16 

REG LIST = *," 

TEMP LOC = POS(0) 'TEMP' SPAN('0123456789!) RPOS (0) 
: (REG, END) 


SS SS SS TN 
( REG() will return an available register. If all registers | 
| are associated with locations, it will free up the | 


| register with the lowest priority. | 
AA A Á— oT a  ———À———————— ae———————!'"——— — DY 


REG REG =  LT(REG,NO REGS) REG + 1 : F(REG_1) 
REG LIST '(' REG '!)' : F (RETURN) S (REG) 

REG_1 REG LIST ',' BREAK('(') . LOC '(' 

+ BREAK(')') . REG !J' = *?,! 
LOC TEMP_LOC : F (RETURN) 
OUTPUT = ' STORE ' REG ',' LOC : (RETURN) 


A A ME MMC ICD M NCC C c CI EM MM EE 
| FREE(REG) will free a register for other associations. l 
¡AAA A AR AAA AAA 
FREE REG LIST ',' BREAK('(') '(' REG '!)' = : (RETURN) 
E NS 
| ISREG(LOC) is a predicate which will determine if LOC is | 
| currently associated with a register. If so it will boost | 
| its priority. l 
AAA EEN EE A A ree O LU OE EREE 
ISREG REG LIST ',' LOC '(' BREAK(t)') . ISREG ')* = 

+ : F (FRETURN) 

REG LIST = REG LIST LOC '(' ISREG '),' : (RETURN) 


ama E merece PU AT T mm 
| ASSOC(LOC,REG) will associate an unsubscripted location | 


| with a register. | 
P —— € — Av DC DRE. c— O E EN IS A IA | 


Assoc LOC !(" : S (RETURN) 
REG LIST = REG LIST LOC '(' REG !'),'  : (RETURN) 


a EN KD CE LEM NCC ee CT ee ET ee LOAD ED RUN ME ee Ong he SE = were a NM Nae 
| DEASSOC (REG) will remove any association a register has | 
( with a location but will not free the register. | 
lL o co 
DEASSOC REG_LIST ',' BREAK('(') '(' REG !)' = 

+ ',(' REG !)' : (RETURN) 

REG END 


Epilogue 


Note that a distinction is made between a register which is 
free and one which is merely disassociated. This distinction 
is necessary because when a register is about to be stored it 
is not yet free (for use as an index register for example) and 
yet it may unrelated to any given variable. Note also that 
although a register could theoretically be associated with two 
different location (such as after A = B), TUPLE allows only 
one such association. 


No distinction is made between fixed and floating point 
operands of the relational operators. We are here assuming 
that floating numbers operate on the same equality scale as 
integers (a common case). 


Co. ee | 

Y Prograr 11 A macro system is basically a method whereby 
i 18.8 E the user of the system may define and employ 
ti GPM 11 abbreviations. GPM stands for General Pur- 
— A pose Macro processor and was developed by 


Strachey [1965]. GPM is general purpose in two ways; it can 
be employed as a preprocessor for an arbitrary language and it 
can produce arbitrary string computations. 


Macros first grew into prominence with the development of as- 
semblers. Initially they were mere abbreviations for instruc- 
tion sequences but soon grew more sophisticated with the 
introduction of arguments, conditional assembly instructions, 
repeat and sequencing facilities. Macros were able to define 
other macros and redefine themselves. McIlroy [ 1960] describes 
many of these techniques. 


It was soon realized that a complete computational facility 
could be implemented relatively easily based on little more 
than the ability to define a macro and GPM was one of the 
first complete languages to be based on a macro system. But 
whereas GPM is complete, as we shall see later, one must al- 
most stand on one's computational head to perform certain com- 
mon operations (e.g., see Exers. 18.25 and 18.27). 


We will write GPM as a function GPM(S) which will return a 
translation of string S. If S does not contain either of the 
two special characters '#' or '«', it will be returned intact. 
A sequence of the form: 
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name, arg,, argeo, «ee y argn; 


is considered to be a macro call. Macro calls within the 
string S will be replaced by an evaluation. Every macro call 
returns a string (which is possibly null). This returned 


string is again passed through GPM by a recursive call to ob- 
tain the macro's evaluation. 


The built-in macro DEF allows macros to be defined. 
#DEF,name,pr; 


will define a macro by the given name and associate it with a 
prototype pr. It returns the null string. For example, 


#DEF,M, STRING; 


will define a macro M whose prototype is 'STRING!. When M is 
called as in: 


#M; 
the value returned is 'STRING'. Hence: 
GPM ('#DEF,M, STRING; x #M3y') 
will return 'xSTRINGy'. 
In some respects, the DEF function may be thought of as as- 
Signing a string to a name. But a macro may also have  argu- 
ments which may be embedded within the prototype. The position 
of the first, second, third, etc. argument is indicated by the 
position of the symbols 81, £2, 63, etc. Thus: 
#DEF, SQUARE, & 1*& 1; 

defines the macro SQUARE with one argument. The macro call: 

#SQUARE, (X+Y) ; 
returns  ' (X+Y) * (X+Y) !. Within the argument list of a macro 
call there may be other macro calls and these are evaluated to 
obtain the actual arguments. For example, 

# SQUARE, $M; Y; 
returns 'STRINGY*STRINGY'. The macro call may be suppressed 
by surrounding a string with pointed brackets. Thus 
GPM ('AA<#>AAt) returns '2BARAA!, Pointed brackets are stipped 
off in pairs. Thus, GPM('A<B<C>D>E') returns 'AB<C>DE'. Poin- 
ted brackets may be used to defer evaluation of macro calls 


until some later time. Thus 


#DEF,A,<#M3>; 
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will associate with A the prototype '#M;'. When A is called 
as in 4A; the returned string is evaluated leading to a Call 
on 4M; which returns 'STRING'. 


Were the returned values merely substituted for the macro call 
without again being evaluated, the macro system we have 
described so far would only be useful as a system of forming 
abbreviations. But by the simple act of reevaluating the re- 
turned value, we obtain a general purpose computational 
language, a language capable of expressing anything com- 
putable. This is a remarkable fact. To see that this is so, 
consider defining a conditional macro #COND,X,Y,2; which 
evaluates to Z if X equals Y and evaluates to null otherwise. 
On the one hand, if the returned string were not reevaluated 
it would be impossible to write COND (should it be written as 
the null string or as &3 ?) and hence GPM would not be com- 
pletely general. On the other hand, a conditional allows one 
to simulate a Turing machine and hence perform arbitrary com- 
putations. To see this reflect that a state-transition table 
(as in a Turirg machine) may be implemented as a collection of 
conditionals (one for every combination of states and inputs). 


We may write $COND,X,Y,2; as: 
#DEF,COND,<#DEF,<& 1>, ; #DEF, <&2>, <&3>3#<E1>3>; 


In the above, the first argument is defined as a macro which 
evaluates to null. The second argument is also defined as a 
macro and this definition overrides the first if and only if 
the first two arguments are equal (a macro name need not be an 
identifier but may be any string of symbols). Finally, the 
macro named by the first argument is called. The returned 
value is the third argument if the second definition overrode 
the first. Programming in this language is opaque but is per- 
fectly general. If the argument to GPM is not well-formed, 
meaning that if a '$' is not followed by a corresponding ‘';' 
or that a '«' is not followed ky a corresponding '>!, GPM will 
fail. This fact can be used to apply GPM to a program without 
reading it into main storage in its entirety. Only a suf- 
ficient amount of it need be read to enable GPM to succeed. 
Said another way, if GPM(S1) succeeds then GPM(S1) GPM(S2) 
equals GPM(S1 S2). 


There is one point in which the implementation given departs 
from official GPM as defined by Strachey. Macro definitions 
here are global and not local to the evaluation of a specific 
macro. Assume the following definition occurs. 


&DEF,X,Initialization <#DEF,X,Action; *X; >; 


In our system, *X; will evaluate to 'Initialization Action! 
on the first call and to ‘Action' on all subsequent calls. 
This is because the macro X redefines itself. In Strachey's 
system the macro definitions are pushed so that when return is 
made to the outer level the original definitions remain in- 
tact. Hence a macro could not redefine itself. There are 
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advantages and disadvantages to both. As a computation tool, 
Strachey's system is perhaps superior since macro names can 
serve as temporary variables. For a practical macro processor, 
however, it is better to have global macro names. 


DEFINE ('GPM(S) PREFIX,BOD,ARG,NAME,N,PUSH POP!) 
E UT TIAE ETE TIME. cr. Rr e Er T xr El e. PIS 
Initialization section for GPM: FORB CH (forbidden charac- 
ter) is assigned a character not permitted in the source 
string. GPM BAL is assigned a pattern which will match a 
string balanced in the GPM sense. Note that although <> 
and #; both serve as a kind of parenthesis they are not 
symmetric. 
ee ET | 


SALPHABET LFN(1) . FORP_CH 


MAC_TBL = TABLE() 

ITEM = '<' BAL('<>') '>" | '#' *GPM BAL ';' 
+ | NOTANY('<#"') BREAK('<#>;,"') 

GPM_BAL = ARBNO(ITEM) 


O E RE aN Be SS ge TE ee ge OT eee E | 
{| This is the basic pattern used to process strings. PREFIX | 
| is the string up to a macro call or a <...> literal. BOD | 
| will be either the literal body or the result of | 
| evaluating the macro | 
p ———— ——————ÉÁÓÉ— II — € — | 

GET.PREFIX.BOD =  POS(0) BREAK('<#"') . PREFIX FENCE 


+ ('<' BAL('<>') . BOD '>! 

+ '&' GPM BAL . NAME . *PROC('NAME') 

+ ARBNO(',' GPM BAL . ARG . *PROC('ARG')) 

* 's' . *PROC('MEND') ) 

+ REM . PREFIX NULL . BOD : (GPM END) 


GG E CEEE A SCC I XC MID C OLEO D C M CC MM a C MC CE a D LCD I OE NEEDS | 
| Entry point: | 
E II O A NI A A | 


GPM IDENT (S) :S (RETURN) 
S GET.PREFIX.BOD = : F (FRETURN) 
GPM = GPM PREFIX BOD : (GPM) 
GPM_END 


yp et A IC RD ee Gn ee a ae 
| The routine PROC will process macro names (at PNAME) macro | 


| arguments (at PARG), and macro terminations (at PMEND). | 
———— —— ———————— —————"—— " ——— —— ———— —r———ro— apo ug uc p 


DEFINE (' PROC (TYPE) *) : (PROC_END) 
PROC PROC =  .DUMMY :($('P' TYPE)) 
PNAME | NAME = GPM(NAME) 

N = 0 

PUSH_POP = 

PUSH (NAME) : (NRETURN) 
PARG PUSH (GPM (ARG) ) 

N = N+1 : (NRETURN) 
PMEND BOD = IDENT(NAME,'DEF') POP() :F(PMEND 2) 

MAC_TBL<POP()> = BOD 


BOD = : (NRETURN) 
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PMEND 2 BOD = REPLACE (MAC_TBL<NAME>,'&', FORB_CH) 
PMEND 1 BOD  FORB.CHN = TOP() :S(PMEND 1) 
N = N- 1 
POP () :S(PMEND . 1) 
BOD = GPM(BOD) : (NRETURN) 
PROC_END 
Names_referenced Name Type Where_defined 
by_GPM: BAL * Function Program 8.3 
PUSH Function Program 5.5 
POP Function Program 5.6 


x indicates name is referenced in the initialization section. 
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E SE A 

| Exercise 18.1 | Suggest a method (or methods) whereby the 
t—————— OPS and SYMS tables of ASM (Prog. 18.1) can 
be made smaller at the expense of time. Implement one of your 
plans. 


NAAA o Tae 

| Exercise 18.2 | Add expressions to ASM (binary +, -,. * and 
LLL—————————————4 / and unary -) by modifying the semantic 
routines of L_ONE for the purpose. Let the period (.) mean 
the current address. 


oan A aS a E S o A : 
| Exercise 18.3 | Assuming there are eight bits per charac- 
AÑ ter, how would you modify ASM to output (on 


the PUNCH file) a 32-bit word as four characters. 


Gee ee eg | 

| Exercise 18.4 | Modify ASM to allow symbols of the form 
AAA =<constant>. For example, =37 implies the 
address of the constant 37. (This convention was actually as- 
sumed by TUPLE, Prog. 18.7.) Be sure to avoid generating 
duplicate constants. All such literals should be placed after 
the last instruction of the program being assembled. 


(Coe pea ae ee ee eee 
| Exercise 18.5 | What character is not permitted in the ar- 
JJ  gument to S(name), the semantic subfunction 
of L_ONE, Prog. 18.2? How can S(name) be modified to avoid 
this restriction? 


SS | 
| Exercise 18.6 | Augment Language L, (Figure 18.2) by al- 
t———— lowing subscripted expressions. Modify 


L_ONE accordingly. 
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[PC ee | 

| Exercise 18.7 | Identifiers seen by L_ONE are passed on to 
LS the assembler untouched. This is not always 
desirable. Modify L_ONE so that each identifier is replaced 


by a unique ‘internal’ name. 


Ge ee ee ETAN . 
| Exercise 18.8 | Extend L ONE to handle real arithmetic. An 


3 identifier is assumed to be integer or real 
(floating point) depending on whether or not it begins with 
one Of the letters 'IJKLMN'. Allow mixed expressions both in 
binary operations and across an assignment. Assume two ad- 
ditional instructions for machine M, viz. CIR which converts 
from integer to real (loading into the target register) and 
CRI which converts from real to integer. 


= rene dq 
| Exercise 18.9 | Write a program which will read in a BNF 
t grammar and produce for each syntactic 
variable <V> a pattern named V that will match it. Assume 
there are no extraneous blanks. (This requires about eight 


instructions.) 


Oo pe a RED 

| Exercise 18.10 | It has been observed that well over half 
AMAS Of all Fortran programs appearing on 
listings dumped into a certain trash can contain no interior 
blanks. Use this observation to improve the speed of blanks. 


p 7 Re | 
| Exercise 18.11 | If BLINT (a pattern in BLANKS, Prog. 18.3) 
AN is simplified to SPAN(NUM * *) then BLANKS 


will operate incorrectly in some cases.  Furnish such a case. 


Oo oe e uen 
| Exercise 18.12 | A squemish programmer, wishing to avoid 
A lEeft-recursion writes, for the definition 


of E (a pattern in POL, Prog. 18.4): 
E = TERM ADDOP *E POL(2) | TERM 


What error has been introduced? Give an example of a statement 
which would yield incorrect results. 


a aE 
| Exercise 18.13 | Modify POL so that a null statement is al- 
t— ———— lowed. This would permit, for example, 


the sequence: 


IF A=1 THEN ELSE X = 2 


GS pe ee a ee 

| Exercise 18.14 | Modify POL, Prog. 18.4, to allow IF ... 
t-———————— THEN ... ELSE type expressions. An example 
is: 
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A = IF A» 0 THEN 1 ELSE -1 


Transform this syntax into Polish using a 3-ary operator 
called EIF (Expression IF). 


[AYUDA ee 
{| Exercise 18.15 | This exercise indicates how error messages 
t—-—————— may be incorporated into POL(). Write a 


function DNF(S1,S2) (Did Not Follow) which will form the 
message: 


A valid ... S1 ... was encountered but 
this was not followed by a valid ... S2 ... 


This is to be appended onto a global error message string 
(MESSAGE) which is printed if the statement cannot be matched. 
Using DNF, modify the patterns of POL, Prog. 18.4, to issue 
error messages in the following cases: (1) an expression 


doesn't follow an '-' in assignment. (2) a Boolean doesn't 
follow an IF. (3) a statement doesn't follow a 'THEN'. (4) a 
primary doesn't follow a unary minus. (5) an expression 


doesn't follow a '('. 


| IX ee, ee oe es eee 

| Exercise 18.16 | This exercise indicates how SNOBOL4 pat- 
AS» tern matching can be used on the inter- 
mediate form to achieve a degree of machine-independent code 
optimization. Scan a Polish string (as Output by POL, but with 
a trailing comma) for a pattern which resulted from an assign- 
ment of the form 


<VAR> = <VAR> + «XE» 


where <VAR> is the same (possibly subscripted) variable. 
Transform this into the 2-ary form: 


AUG:2,<VAR>, <E> 
Do the same for an assignment in which the <E> is the first 


operand. 


ee 
| Exercise 18.17 | Write a pattern to match an arbitrary tree 
t———— with no upper limit on the number of 
leaves. 


q 

| Exercise 18.18 | Modify TREE to accept N additional argu- 

A ments, NAME1, NAME2, ..., NAMEn which are 

to be associated with the various leaves of the tree. Thus 
TREE('*', 2, .NAMET1, .NAME2) 


will return, in effect, 


eZ” ARB_TREE . NAME1 ARB_TREE . NAME2 
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TO do this exercise, you must assume some maximum N (already 
assumed anyway in the coding of TREE). For extra credit, make 
your program entirely dependent on the parameter MAX N. 


Co ee ee 

| Exercise 18.19 | In POL, Prog 18.4, argument lists were 
C compiled into a Polish notation having the 
form: 

Use pattern matching to convert this into the form: 


COMMA:n, arg,4,argea, eee 


X= ED, | 

| Exercise 18.20 | Modify TR, Prog. 18.6, to handle mixed ex- 
3 pressions, both in the binary arithmetic 
operations and relations and across assignments. Assume tuples 


CVTIR,Arg,,,Arg3 
CVTRI,Arg,,,Arg3 


exist to convert from integer to real and real to integer 
respectively. 


y i : : : 
| Exercise 18.21 | The following exercise extends TR (Prog. 
t- 18.6) to include functions. Assume that 


the tuples required for output for the function reference: 
FUNC (Ara,, Argos, eee e Argn) 
are 


ARG, Arg 1 
ARG, Argo 
ARG, Argn 
CALL, FUNC, , RES 


where RES is the location in which the result is deposited. 
Assume that the function ATEST(ID) exists which is a predicate 
to determine whether ID is an array. If ID is not an array, 
it must be a function. 


ESA KCN LG ME MERE 
| Exercise 18.22 | Modify L ONE to call TUPLE rather than 
t———— producing unoptimized code. 


ux EE ME M DIU a eee aA 

| Exercise 18.23 | TUPLE (Prog. 18.7) is stupid in not op- 
AS» timizing the case where the 2nd argument 
is already in a register and the first argument is not and the 
operation is (F)ADD or (F)MUL. Modify TUPLE to handle this. 
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AAA AA 
| Exercise 18.24 | The action taken by TUPLE for a label is 
LLLL————————————4 rather ruthless (removing all previous 
register associations). For labels generated as a result of 


IF processing, only those symbols need by disassociated that 
are actually modified by one of the clauses. Write a routine 
that will scan the output of TR to determine which symbols are 
modified and arrange to have only these disassociated when 
IF-type labels are encountered. 


Cee eet ee EU ep ae er 
{| Exercise 18.25 | The following formula from Strachey [1965] 
A defines a macro S with one argument. 


#DEF,S,<#1,2,3,4,5,6,7,8,9, 10, FDEF, 1,<&>61; 3>3 


What is the result of (a) #S,2; (b) #S,5; (c) In words, what 
does S do? 


GERE ee ED Pe ee 

| Exercise 18.26 | Modify ASM so that it uses GPM as a macro 
AÑ processor. Allow macro prototypes to con- 
tain more than one line. This can be done by encoding line 
boundaries as a special character sequence. 


| Spi E RESO CS RUE N 
| Exercise 18.27 | It is sometimes required to build up a 
t——————— large string at assembly time. Write a 


macro #CS,S; (Concatenate String) such that when #S; is called 
all the strings so far passed to CS will be returned 
concatenated together. 


(C1 (c1 n rv n$n ^ 1 t1 V^ oe € rr: 
[t— Fatih tt th Geet Get teat NIE T 
eee ROB IIS E 
II dg db Go II (0 SNL CÁM 
=i 4c qc qp 046 pg (O91 ung Og. cl 
tl Cd A A Ls iJ) tuts LJ LJ ti. 
FOR ODD-NUMBERED EXERCISES 
(q Solutions ZITITI==IEN==izz==zm== 
umi muri co ios o SS for a mA IA IS SS SS 
E I UL P P E E i d Chapter 2 “=== == ee 
2.1 The body of the function UP(ARG) is 
UP UP = REPLACE(ARG,LOWERS ,UPPERS ) < (RETURN) 
2.3 
L P (POS(0) (SPAN(' ') | '!) | '. ty. 
+ ANY (UPPERS) . C = T UPLO(C) :S (L) 
P = UPLO(P) 
222 
SIZE (BASEB (K, 2) ) 
SIZE (BASEB (K,n)) 
227 
DEFINE ('V (ARG) B,S,E,F*) : (V_END) 
V B = BASEB(BASE10 (ARG, 16) ,2) 
B LEN(1) . S LEN(10 . E REM. F 
V = (-1) ** S CONVERT(BASE10(F,2) ,'REAL') * 
" 2 ** (BASE10(E,2) - 1045) 
: (RETURN) 
V END 


2.9 Those involving built-in numerical operators: EQ, REMDR, 
/, * and * (four statements in all). 


2.11 Initialize H with '01234567'; then replace all 16's by 
8's and replace all HEX's by OCT's. 
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2.13 After doing the obvious checks on the month being in the 
range 1-12 and the day being in the range 1-31, see if the day 
is either the 29th, 30th or 31st. If so, and the DAY (of the 
week) is equal to the DAY of the first, second or third of the 
following month, the day is invalid. 


2.15 M = CEIL((5 * D - 150) / 153.) (See the chapter on 
arithmetic for an analysis of this); then take the number of 
days and subtract off 31+28 (or 31+29 in a leap year); if this 
number is negative, add the number of days in the year (365 or 
366). Use the formula above to determine M. Then REMDR(M * 


2, 12) * 1 is the month. 


2.17 Insert a test and branch at the entry point of SPELL and 
insert a section of code labeled SPELL LONG as follows: 


SPELL LE(SIZE(N) ,6) :F (SPELL_LONG) 
SPELL_LONG N | RTAB(6) . M= 
SPELL = SPELL(M) 


SPELL  'SEPT' =  'OCT' 

SPELL ‘'SEXT* =  'SEPT' 

SPELL ‘QUINT’ =  'SEXT' 

SPELL  'QUADR'! =  'QUINT' 

SPELL  'TRIILION! =  'QUADRIILION' 

SPELL ‘BILLION' = ‘TRILLION! 

SPELL = SPELL ' BILLION! 

SPELL = NE(N,0) SPELL ' * SPELL(N) : (RETURN) 


mIzIIÉImIÉmmmmmmÉmlÉmIÉm—rzÉcÉ Solutions =I2II=====I=z==z==z==== 
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3.1 RPAD(S,N,C) = REVERSE (LPAD (REVERSE (S) ,N,C) 


3.3 RPAD(LPAD(S,(N - SIZE(S)) / 2,C),N,C) 


3.5 (a) REPLACE('CXCB','BBCD!,S) ; (b) 4 
3.7 (a) 
DEFINE('TPOS (S, H,W) K,C') : (TPOS END) 
TPOS S POS(K) LEN(1) . C :F(TPOS 1) 
TPOS = TPOS C 
K = K +wW : (TPOS) 
TPOS 1 GE(SIZE(TPOS), H * W) : S (RETURN) 
K = REMDR(K,W) + 1 : (TPOS) 
TPOS END 
(b) 
ALPHABET LEN(H * W . S1 


S2 = TPOS(S1) 
(c) 
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DEFINE (' ENCODE (S) T!) 
SALPHABET LEN(H * W) . S1 


PS1 = TPOS(S1,H,W) : (ENCODE_ END) 
ENCCDE S  LEN(H* W).T = :F(ENCODE. 1) 

ENCODE = ENCODE REPLACE(PS1, S1, T) : (ENCODE) 
ENCODE_ 1 

S = S DUPL(':', H * W- SIZE(S)) 

ENCODE = ENCODE REPLACE (PS1,S1,S) 

ENCODE = DIFF(ENCODE| ':') : (RETURN) 
ENCODE_END 


3.9 Do a positional transformation to obtain the odd charac- 
ters in the string (H1). Then do a similar transformation to 


obtain the even characters (H2). Transliterate H1 so that 
digit k goes to the (16 * k)th character of SALPHARET. Trans- 
literate H2 so that digit K goes to the Kth character. Then 


OR the resulting strings. 
3.11 2*00112233445566778899' 
3.13 IDENT(SKIM(S) ,S) 

3.15 (a) 


REVERSE (REPLACE (TRIM(REPLACE (REVERSE (S) ,'0!*!,* !')),' ','0')) 
(b) +S 


3.17 SWAP, SWAP_ARG1 and SWAP_ARG2 
3.19 a-ht, b-ht, d-h 


.21 (X Y) X.Y Yu 


ees 55 5525255255525 => Solutions hf 
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4.1 M = CRACK('JAN.,FEB.,MARCH,APRIL,...', ',") 

4.3 (a) opposite pairs are swapped twice resulting in a 

mutual cancellation. A remains unchanged, I is set to N + 1. 
(b) SEQ(' J= N+ 1- I ; (GT(J,I) SWAP(.A<I>,.A<J>))',.T1) 

4.5 SEQ(" A<I> POS(0) NOTANY ('M!) Ma d) 

4.7 It is equivalent to AOPA(A1,' ', A2) 

.9  STRINGOUT( AOPA(CRACK(X),' ',CRACK(Y)) ) 

4.11 A<FIND (A, '7LGT'*) > 


4.13 A practical version of the following function would use 
'funny' names for temporaries and parameters. 
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DEFINE (*DO(S,N,L,U,I) ') : (DO. END) 
DO S = CODE(S ' 5; 3:(DO_1)*) : F (FRETURN) 
$N = L <S> 
DO 1 $N = $N + I 
LE($N,U) :SXS»F (RETURN) 
DO END 
4.15 
DEFINE ('PUSH (A, E) *) : (PUSH_END) 
PUSH PUSH = A 
A<1> = A«15 + 1 
PUSH_1 A<XA<1>> = E :S (RETURN) 
A = CATA(A,A) 
PUSH = A : (PUSH_1) 
PUSH_END 
5.1 
DEFINE ('CRACK (S, P) N,V, PAT*) : (CRACK | END) 
CRACK IDENT(B, NULL) :S (CRACK_1) 
S  RTAB(1) B ABORT | REM. S = SB 
PAT = BREAK(B) . V LEN(1) 
CRACK 2 S PAT = :F (RETURN) 
$N =  LINK(,V) 
N = .NEXT($N) : (CRACK 2) 
CRACK 1 PAT = LEN(1) . V : (CRACK. 2) 
CRACK END 
5.3 (a) 
IDENT (PUSH POP) 2S (FRETURN) 
NM = .PUSH POP 
FIRST 1 NM = DIFFER(NEXT($NM)) .NEXT($NM) +:S(FIRST_1) 
FIRST = VALUE ($NM) 
$NM = : (RETURN) 


(b) Use a doubly-linked list as in Ex. 5.2. 


5.5 No modification to REVL is required. 


5.7 
DEFINE (' IFFLD(N,S)1,F*) : (IFFLD_END) 
IFFLD F = FIELD(DATATYPE(S),I + 1) : F (FRETURN) 
I = DIFFER(F,N) I + 1 : S (IFFLD) F (RETURN) 
IFFLD END 


5.9 (1) Insert the four characters ',NEW' behind 'MARK' in 

the DATA function. (2) Use the constant 2 rather than 1 in 

FIEID. (3) The third statement after VISIT 1 should read: 
FLD(SON,I) = GT(...) NEW(GS) :S(VISIT. 1) 
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(4) Change VISIT_2 to: 


VISIT 2 NEW(SON) = COPY(SON) ; SON = NEW(SON) 

(5) Return the copied configuration by modifying VISIT_3 to: 
VISIT_3 VISIT = IDENT(FATHER) SON : S (RETURN) 
===================== solutions ==2=========="z===== 
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6.1 a-F, b-T, c-F, d-F, e-T, f-T, g-F, h-T, i-T, j-T. 


6.3 The canonical form is 'BED* | 'BEDS' | 'BEAD* | 'BEADS' | 
'RED' | 'REDS' | 'READ' | *READS'. The pattern is not monic. 


6.5  a-Y, b-N, c-Y, d-Y, e-Y, f-Y, g-Y, h-Y, i-N, j-N. 
6.7 NULL | NULL | NULL | NULL | NULL | ... 
6.9  (L?*3L*2) /2 


6.11 2 ** L 
6. 13 a-Y, b-N, c-N, d-N, e-Y, f-N, g-Y. 
6-13 a) [0, 2] b) [0, 2, 4, 4] c) 2**K 


6.17 ARBNO('AA! B 'A') will match all even-length sequences 


6.19 P, = FENCE 'ABC', Pz = FENCE 'XYZ'. 


.21 
a) RPOS(0) | BREAK(S) SUCCEED 
b) ANY(S) 
C) ANY(S) | BREAK(S) ANY(S) SUCCEED 
d) POS(N) SUCCEED | TAB(N) 
e) P = TAB(N) | RTAB(N) TAB(N) SUCCEED | RTAB(N) X 
cem uma menu ue de erue drea em Chapter 7 mc uou ance eu un aum undae Ds ee 
7.1 
BREAKP C = CURSOR 
BREAKP.1 SUBJECT  POS(CURSOR) ANY (ARG (NODE) ) : S (S) 
CURSOR = GE(CURSOR, LENGTH) C 2S (F) 
CURSOR = CURSOR + 1 : (BREAKP. 1) 
Full credit if LF is used instead of F; half credit if MF is 
used. If the pattern match and test are inverted, take 3/4 


credit. 
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7.3 (a) 2** N (b (4 ** N + 2) 7 3 


7.5 To form a loop of alternates by alternation or a loop of 
subsequents by concatenation would require that the loop go 
through the root of the second argument since this is the only 
kind of arrow added by these operations. But since the second 
argument does not impinge on the first, no loop can be formed. 
If a loop was formed via ARBNO(P) it must go through P. But 
it could not be a loop of alternates since only solid arrows 
are added out of P and it could not be a loop of subsequents 
because only a dotted arrow enters P. 


7.7 a-9, b-20, c-40, d-14, e- 1, f-7 
7.9 a-Yes, b-Yes, c-No, d-Yes 


7.11 Design TAB(N) as a compound consisting of a node TAB1 
and an alternate TAB2. TAB1 pushes the futility flag, TAB2 
restores it and fails. 


7.13 

ARBN1 PUSH (FUTILITY) 
FUTILITY = 1 : (S) 

ARBN2 FUTILITY = EQ(FUTILITY,1) EQ(&FULLSCAN,0) POP() :S(LF) 


POP() : (S) 


7.15 Create a compound similar to Figure 7.8 with NOT1, NOT1B 
and NOT2 in place of VA1, VAB1 and VA2 and with no VAB2. NOT1, 
like VA1, pushes a nonnegative value onto Stack Alpha. NOT2 
changes this to a negative value and fails. NOTIB (NOT! on 
Backup) pops the value and succeeds or fails depending on 
whether the value is positive or negative. 


7.17 Call the root node r. Then 
D(r) = D(s) | LEN(1) D(r) | D(a) 
Since D(r) is supposed to equal ARB D(s) | D(a) we may plug 
this trial value into the right hand side and after some 
manipulation we obtain 
ARB D(s) | LEN(1) D(a) | D(a) 
which does not equal the trial value. 


7.19 
SCAN IDENT (ALT (NODE) ) : ($PROG (NODE) ) 
PUSH (NODE) ; PUSH(CURSOR) 
NODE = ALT(NODE) : (SCAN) 
S NODE =  SUBS(NODE) 
IDENT (NODE) : S (RETURN) F (SCAN) 
F CURSOR = POP() ; NODE = POP() 


IDENT (NODE) : S (FRETURN) F ($PROG (NODE) ) 
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Fo ii Solutions pS LLELi 
8.1 ARBNO(NOTANY(S)) RPOS(0) | BREAK(S) 
8.3 Replace calls to BREAK by calls to BREAKREM. 
8.5 3,4,5,6 
8.7 When NAME is converted to expression the result is not 
EVAL'ed as an identifier but as a concatentation. 
8. NULL 
8.11 IF(P) = NOT(NOT(P)) 
8.13 In the fourth line following LIKE_1 add a third alterna- 
tive to produce: 
LIKE = LIKE | T1 T2 | T1 LEN(!) T2 
8.15 either parenthesis 
8.17 
QLIT = Q BREAK(Q) Q 
CMNT = '/*' ARBNO (NOT ('*/') LEN(1)) '*/' 
ELEM = QLIT | CMNT | NOT(Q | '/*') LEN(1) BREAK('/;* Q) 
PLI.STMT = POS(0) (ARBNO(ELEM) ';') . STMT 
8.19 
DEFINE ("NAME (NO) D,X !) : (NAME END) 
NAME NO LEN(1) . D = : F (RETURN) 
' 2ABC 3DEF4GHI 5JKL6MNO7 PRS8TUVOWXY0ZZZ 1***' D LEN(3) . 
NAME = NAME ANY(X) < (NAME) 
NAME_END 
9.1 
DEFINE('READ (P) ') : (READ END) 
READ LT (NF_INPUT, 0) : S(FRETURN) 
READ = POP() : S(READ 1) 
READ = INPUT :F(READ 2) 
READ 1 READ P ¿ S(RETURN) 
PUSH (READ) : (FRETURN) 
READ 2 NF INPUT = NF INPUT - 1 : (READ) 


RFAD END 
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9.3 The following will remove blanks except within string 
literals as defined in the exercise. To handle !real* Fortran 
we must be a bit more sophisticated. See BLANKS, Prog. 18.3. 


Before returning, execute the following code. The patterns 
can (and perhaps should ke) defined out of line. 
Q = eee ce - QQ = eee 
QLIT = Q BREAK (Q) Q | QQ BREAK (QQ) QQ 
HOL = SPAN('0123456789') $ N 'H' LEN(*N) 
PAT =  POS(0) ARB . T1 NULL . T2 
+ (SPAN(* ') | (QLIT | HOL) . T2) 
FORTREAD LEN(6) . T = 
FORTREAD_2 FORTREAD PAT = :F(FORTREAD 3) 
T = T T1 T2 : (FORTREAD 2) 
FORTREAD_ 3 FORTREAD = T FORTREAD : (RETURN) 


The above will not handle the rare case that the integer 
preceding the H in a holerith literal contains interspersed 
blanks. This can be handled as follows (take extra credit if 
you did this): 

HOL = SPAN('0123456789 ') $ N 'H' LEN(X*DIFF(N,' ')) 


9.5 The following rendition of ASMREAD assumes that the READ 
routine removes comments. 

DEFINE ("ASMREAD ()A,T*) 

CONTINUE = TAB(71) . T NOTANY(' *) 


CONTINUE16 =  DUPL(' ',16) CONTINUE 

ORDINARY = TAB(71) . T 

ORDINARY16 = DUPL(' ',16) ORDINARY  :(ASMREAD END) 
ASMREAD A = _ READ(CONTINUE) T :S (ASM, 1) 

ASMREAD = READ (ORDINARY) T : S (RETURN) F (FRETURN) 
ASM 1 A = READ(CONTINUE16) A T :S (ASM, 1) 

A = READ(ORDINARY16) A T : F (RETURN) 

ASMREAD = A : (RETURN) 


ASMREAD_END 
9.7 (a) S POS(C- 1) LEN(L) . A = LPAD(TRIM(A) ,L) 


(b) To convert X's in S to number pairs write: 


LOOP S BREAK('X') @K SPAN('X') . X AL : F(DONE) 
PAIRS = PAIRS '(' N+ K ',' SIZE(X) ')! 
N = N* L : (LOOP) 

DONE 

The rest is straightforward. 

9.9 (a) 
PEEL.K2. = POS(0) TAB(*K1.) (ANY(AFTER) @K2. | 

+ LEN (1) FASTBAL(,'"' "en, BEFORE AFTER) 

+ (8K2. ANY(BEFORE) | ANY (AFTER) @K2.) 

+ l REM @K2.) 

(b) Make AFTER, BEFORE and C temporaries to PEEL. Define 


PEEL.K2. with unevaluated expressions *AFTER and *BEFORE in 
place of AFTER and BEFORE respectively. Replace the branch to 
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PEEL_1 in the first statement of PEEL to PEEL_3; also change 
the branch to ERROR by a branch to PEEL_3. PEEL_3 is defined 
as: 

PFEL 3 K1. = 0 


t: ,)>" BEFORE LEN(1) . C : F (ERROR) 

BEFORE = BEFORE C 

'= ,(<' AFTER LEN(1) . C 

AFTER = AFTER C : (PEEL 1) 
9.11 

NONID = NOTANY ( 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_.') 

L1 X = SNOREAD() : F (END) 
L2 X (NONID ARBNO(!' .')) . N ‘ALPHA(' = 
+ N ‘ALPHANUMERIC (' :S (L2) 

SNOPUT (X) : (11) 
END 
memes Solutions uem = = = == 
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10.1 In the line after BNORM 1 change the go-to field to 


: (FRETURN) S(RETURN) and in the line labeled BNORM_UNB change 
the go-to field to : (FRETURN). 


10.3 If there is an inversion then the spacing between the 
two characters must be < -2. But no string can have a spacing 
this negative unless it contained a double BSPACE. 


10.5 

NB = NOTANY(BSPACE) 

INORM(S,) (POS(0) | NB) INORM (Sə) (NB | RPOS(0)) 
10.7 

PR POS =  POS(0) AN BREAK(BSPACE) @N FAIL | 
* POS(0) *NE(N,0) TAB(*(N- 1)) . S1 
" (LEN(1) ARBNO(BSPACE LEN(1)) . C1 
+ (NOTANY (BSPACE) | RPOS(0)) . C 
10.9 (a) Change the line UF1 = LT(UF1,0) -UF1 

to UF1 = LT(UF1,0) (-2 * UF1) 

(b) Modify 

UFI = CW -W 

UF1 = LT(UF1,0) -UF1 

UF1 = UF1 + SIZE(HYPHEN) 
to 

UF1 = UF_P * (CW - W) 

UF1 = LT(UF1,0) - (UF C * (UF1 / UF_P)) 

UF1 = UF1 + UF_H * SIZE(HYPHEN 
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(b) 


l | 
k = | value HYPHEN | value HYPHEN 
e —— e— — e qmm Se Se SSeS eS mem mum en SS 
2 | 4 - | 9 null 
4 | 8 - | 9 null 
6 | 8 - | 9 null 
8 | fails not set | 9 nu11 
10.13 
Replace DIGRAMS = 'XA,-(0)B, +... 
by DIGRAMS = "'XE,~(@FHSY)T ... 
Replace DIGRAM TBL =  TABLE(30) 
by DIGRAM PAT = ABORT 
Replace DIGRAM_TBL<C> = ANY(CC) 
by DIGRAM PAT = C FENCE ANY (CC) | DIGRAM_PAT 
In the pattern HYPH_PAT: 
Replace FENCE ARB LEN(1) $C... 
by @K ABORT 
Replace RWORD  HYPH PAT :F (FRETURN) 
by 
RWORD HYPH PAT :S(HYPH 3) 
HYPH 2 K= K+ 1 LT(K, SIZE(RWORD) - 1) :F (FRETURN) 
RWORD TAB(K - 1) DIGRAM PAT :F(HYPH 2) 
HYPH 3 
10.15 (a) 
DEFINE ('PRIMAGE (S) I') 
OUTPUT (.OVER, +... ) : (PRIMAGE_END) 
PRIMAGE OUTPUT = IMAGE(S, 1) 
OVER = IMAGE(S,0) 
PRIMAGE_1 I = I + 1 
OVER = IMAGE(S,I + 1) :S(PRIMAGE 1) F (RETURN) 
PRIMAGE_END 
(b) S1 = BNORM(S1) ; S2 = BNORM(S2) 
PRIMAGE(DUPL(' ',9) S1 DUPL(' *,50 - SPACING(S1)) S2) 
10.17 
P = BNORM(P) 
LINE INIT (P) 
LOOP LENGTHS BREAK(',') . CW ',' = :F (DONE) 
PRIMAGE (DUPL(' *, (60 - CW) / 2) LINE(CW)) : (LOOP) 
10.19 
L S teet ('(' BAL . K !)' | LEN(1) . K) 
+ = DUPL(' *, SIZE(K)) DUPL(BSPACE, SIZE(K)) K 
+ 2S (L) 


S = BNORM(S) 
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OUTPUT = IMAGE(S,2) 

OUTPUT = IMAGE(S, 1) 
=== ss ==> =—=—=-—= Solutions ======o=o============== 
=== === ==========s==S for SS === ————— ae —— 
me === a Chapter 11 co=m=o=====o==o========== 


1 a-No, b-No, c-Yes, d-Yes, e-Yes, f-Yes, g-No 
3 a-t; b-3, c-3, d-2, e-0, f-3, q-4, h-2, 1-0 
1.5 ' I = 0' 


11.7 Recursive: F(1) = .164, F(n) = .140n + .006 
Iterative: F(1) = .126, F(n) = .096n + .030 


11.9 

OPSYN('CODE. ', 'CODE!) 

DEFINE ('CODE (S) *) : (CODE, END) 
CODE :XCODE.(' | CODENO = ESTNO + 1 2 (CODE_1)*)> 
CODE 1 CODE = CODE. (S) : (RETURN) 
CODE END 


11.11 Write a routine CAPTURE(T1,S1) which is called by 
TPROFILE upon entry as CAPTURE(TIME(), SLASTNO) 


—IILILLÉm—IÍÉLÉcII—ummIÉmIÉIÉIÉIIÉc Solutions —mmmImIÉImImÍÉIÍIIÉmILlILIÉl-IÉ 
2222233233323222 for € _ 3233232332322222233S 
==2====2==I=I====z===== Chapter 12 === 
12.1 (a-e) 38,11,86,-,24 
12.3 

RADIX = 0 

I = 0 

FACTOR = 1 
LOOP V  BREAK(',') . V1 LEN(1) = : F (DONE) 


RADIX = RADIX + 1 
FACTOR = FACTOR * RADIX 
I = V1 * FACTOR + I : (LOOP) 


5 Add 1 to the number associated with the record 1, 2, 3, 
eee, n-1 to obtain 

1 + 11 + 2*2! + 3*3! + ... + (n-1)* (n-1)! 
Note that k! + k*k! = (k*1)! so that the first two terms 
keep collapsing until only one term is left, viz. n! 


12.7 1,0,null string, I 


12.9 PERMUTATION(S, 6 * 5 * 4 * 3 * 2 - 1) 
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12.11 (a) The statements which need modification are: 
N = REMDR(I, RADIX) 
I = I / RADIX 
(b) Perform ‘short division' on the string. The function 


below will divide a string by an integer and return the 
quotient. R is a global variable set to equal the remainder. 


DEFINE ('DIVIDE (S,1)*') : (DIVIDE_END) 
DIVIDE R = 
DIVIDE_1 S LEN(1)) . T = :F (RETURN) 
R = R T 
DIVIDE = DIVIDE (R / I) 
R = REMDR(R, I) : (DIVIDE_1) 
DIVIDE END 
So the two statements may ke replaced with: 
I = DIVIDE(I,RADIX) 
N = R 


12.13 After PERM_INIT insert the statement: 


(EQ(SIZE A,1)  DEFINE('PERM(A)','PERM F')) : S (RETURN) 


12.15 

Change: SIZE_A =  *PROTOTYPE(A) 

To: SIZE A = SIZE(G_S) 

Change: SWAP (.A<AL>, .A<AL + D>) 

To: G S POS(AL + D- 1) LEN(2) . T = REVERSE(T) 


12.17 (a) 100, (b) 20 


12.19 (1) At the entry point, put in an explicit check for 
the null string in order to break recursion. (2) Obtain C 
from &ALPHABET as follows: 

REVERSE (ALPHABET) ANY(S) . C 
(3) Remove the statement at REORDER 1! and shift the label to 
the next statement. (4) Remove the second parameter from the 
function definition and from the recursive call. 


12.21 All reorderings. The function has no memory so that if 
it produced, say, 'ABBC' twice, as it would have to do if it 
produced all permutations of  'ABBC', then it would never 
produce anything else. 


12223 (a) P, (b) P, (c) I, (da) I. 


na === So mE EST Solutions === == === css = 
Eme mee Ecce for e 
mee mu ecc cec Chapter 13 Ime mmc 


13.1 The 2 instructions starting with BSORT 2 constitute the 
inner loop. An improvement is to add an instruction 

V1 = AXK> 
and use V1 in place of A<K> in two places. This saves one ar- 
ray reference but adds an assignment statement; it is faster 
but just barely. 


13.3 Replace the two RETURN's by transfers to HSORT X. Then 
replace the two calls to HSORT by the following instructions: 
PUSH(I) ; PUSH(K) 


I = K+ 1 : (HSORT) 
HSORT X N =  POP() :F (RETURN) 
I = POP() : (HSORT) 
13.5 
DEFINE('GRTH (X,Y) ') 
GRTH GT(X,Y * R) : S (RETURN) F (FRETURN) 
GRTH END 
I = MSORT(A,'GRTH!) 
A = AI(A,I) 
13.7 MSORT(A, 'LT!') 
13.9 Add one more alternand: 


SS PAT = ... | RPOS(0) . T 
13.11 Add the statement LSON(T) = NULL before LIN 1. 


213 (a) 2(n+1) (1/2 + 1/3 + ... + 1/(n+1)) - 2n 
(b) 21n2 = 1.38 


Z===z=2===2===o=otz=o===== Solutions --  _—__ 
<=<cic=orm=o=o=o:i========= for (A 
 - LLLL-ELELELIL EELE Chapter 14 =I==II==I============ 


14.1 (a) MAX(X,Y) will fail if X < Y. (b) Append a semicolon 
(;) to the argument. 


140.3 Change the : (RETURN) to :S(RETURN) and add the following 
two statements: 

OUTPUT = CODE 

CODE(LBL ' : (FRETURN) ') : (RETURN) 
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14.5 
<Definition of LOADEX function> 
: (START) 
L1 LOADEX('L1*) : (11) 
L2 LOADEX ('L2!) : (L2) 
L100 LOADEX ('L100*) : (L100) 
START 


140.7 Makes no difference. 


14.9 Replace 
PUSH (SANCHOR) ... &ANCHOR = 0 ... SANCHOR = POP() ... 
by 
PUSH (ARB) ... ARB = SARB ... ARB = POP() ... 


14.11 The names used by both packages to name identical 
operations must not be the same. Thus 
REDEFINE ('+*,*CSUM(X,Y) *') would be OK for complex sum, but not 
REDFFINF('*','SUM(X,Y) !). 


14. 13 
DEFINE ('F. (X) !) 
OPSYN('F','F. !) :(F. END) 
F. F = X : (RETURN) 
F END 
14.15 
REDEFINE (' ', 'CAT(X,Y) ') 
CAT CAT = XY() X * Y :S (RETURN) 
CAT = CAT. (X,Y) : (RETURN) 
18.17 


OPSYN (*OPSYN. ! , ' OPSYN') 

DEFINE ('OPSYN (NAME 1, NAME2) ') 

OPSYN('DEFINE. ', 'DEFINE') 

DEFINE. (*DEFINE (PROTO, LBL) NM!) 

DEFINE (' FUNCTION (NAME) *) 

FUNC LIST =  ',OPSYN.,OPSYN,DEFINE, ! 

: (FUNCTION END) 

DEFINE PROTO BREAK('(') . NM 


FUNC LIST =  FUNC LIST NM ',! 
DEFINE. (PROTO, LBI) : (RETURN) 
FUNCTION FUNC LIST  ',' NAME ',' : S (RETURN) F (FRETURN) 


OPSYN FUNC LIST = FUNC LIST NAME! ',' 
OPSYN. (NAME1,NAME2) : (RETURN) 


FUNCTION END 


=======:============== Solutions ========o============= 
====:================= for ===================== 
===================== Chapter 15 ===================== 
15.1 

DEFINE ('COMB (N, M) K') : (COMB. END) 
COMB COMB = 1 
COMB 1 EQ (K,M) : S (RETURN) 

K = K + 1 

COMB = COMB * ((N- M +K) / K : (COMB_ 1) 
COMB_END 
15.3 COMB(L,N) - 1 
15.5 (a) DIFF DIFF = SUM(X,MINUS(Y)) : (RETURN) (b) 5 


15.7 Before the first of the SPLITS insert 
DIV = LE(SUBSTR(Y,1,1), 5) X * 2 / Y * 2 


15.9 X > Y / (CEIL(Z) + 1) 
15.11 (a) E = e2 / 2(e + 1) (b) 5 


13 A= 1, 2, 4, 5 (integers). 


153 (a) 
ASIN (X) = 2 * ASIN(SORT((1 - SQRT(1 - X2)) 7 2)) 
(b) the same as the stopping criterion for SIN(A) 


15.17 105 
CONVERT (LOG (X, 2) , 'INTEGER') + 1 


X / (2 ** N) 
CONVERT(X * 2 ** 27, 'INTEGER!) 


Hx zZ 
H Wu MW 


15.21 The difficulty is that NAT BASE is single precision. 
Replace the second occurrence of NAT BASE by EXP(X / X). 


=<==c================== Solutions ==I==<o================= 
ZAS SES ZE SE SS SS SS SS =S= for ==I=================== 
<=Ici=c=c====o========== Chapter 16 ===================== 


16.1 ID(RANDOM(O0)) 


16.3 Let HA = LEN(5). Then the following statement will ex- 
ecute the deal. 
RPERMUTE (DECK) HA . Pl HA. P2 HA. P3 HA. PU 
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16.5 The last one. Instead of assigning CODE(CODE) toa 
table, simply go to it. The first two statements could also 
be eliminated. 


16.7 In general, any string not containing a balancing right 
bracket to a left bracket will cause looping. One example is 
"('. The cure is to prefix the pattern LEN(1) to LITERAL.TEXT. 


16.9 Let XC be equivalent to C where C is some character. 
Thus %{ is equivalent to | and %&% is equivalent to &. 
Implementation is simple: 

LITERAL. TEXT = POS(0) '%* LEN(1) . TEXT | 
+ BREAK ('<=(%"') . TEXT 


16.11 The probability P must satisfy the equation: 2P = 1 + 
P3, The solutions to this equation are 1, .616, and -1.62. 
The value 1 is unsuitable because the situation is clearly 
worse than the case where it just barely halts. -1.62 is not 
a probability. Hence, by elimination, P = .616 


16.13 (a) 
N = Ne 1 
NUM = LT (RANDOM () , RANDOM () ** 2) NUM + 1.0 
OUTPUT = EQ(REMDR(N, 100) ,0) N ': ' (NUM / N) : (LOOP) 
(b) + .94/SORT(N) ' 


16.15 Replace the rule that begins 'OUTS = GT(' by simply the 
predicate to obtain the statement: 
GT (K,H (S) ) :S(RS OUT) 
Then at RS OUT insert: 
RS OUT ADV = LT(RANDOM(),E)  "'123R' :S(RS 4) 
OUTS = OUTS + 1 


16.17 In the program which follows, FORMAT will format a 
string for output; MIRIM will return the mirror image of any 
given sequence of positions and RSTEP will move half the 
dancers one random step forward making sure no conflicts occur 
among the dancers or their mirror images. 


DEFINE (' FORMAT (S)C*) : (FORMAT_END) 
FORMAT S LEN(1) . C = :F (RETURN) 

FORMAT = FORMAT ' ' C : (FORMAT) 
FORMAT END 

DEFINE (' MIRIM (POS) *) : (MIRIM, END) 
MIRIM MIRIM = REPLACE (POS, 'ABCDEFGHIJKIMXYZ', 
+ 'DCBAIHYFEMLKJZGX') : (RETURN) 
MIRIM, END 

DEFINE ('RSTEP (CPOS) P,NPS,NP*) 

NEXT POS =  'A(ABEF)B(ABCF)E(AEFJX)F(ABEFJK) J (EJFKX) ' 
+ 'K(JFKXGL) X (EJ XK) Y (KYL) ' 

NEXT POS = NEXT POS MIRIM(NEXT POS)  :(RSTEP END) 
RSTEP CPOS  LEN(1]) . P = : F (RETURN) 

NEXT POS P '(' ARB . NPS ')' 

NPS = RPERMUTE (NPS) 


RSTEP 1 NPS LEN(1) . NP = :F (FRETURN) 


'XZ' NP :S(RSTEP. 2) 
(RSTEP MIRIM(RSTEP)) NP :S(RSTEP. 1) 
RSTEP 2 RSTEP -  RSTEP NP : (RSTEP) 
RSTEP END 
OUTPUT = FORMAT('12345678') 
POS =  'XXXX' 
LOOP OUTPUT = FORMAT(POS MIRIM(POS)) 
POS =  RSTEP (POS) 
N = LT(N,100) N+ 1 :S (LOOP) 
END 
I==s=I==c======o======== Solutions =I=rcc==e:ccci==co======== 
=== ec e Ss for === =is=i======>=323===== 
=== =========== Chapter 17 2355S SE iS SS o === 2=22= 


17.1 Assume for the moment that ONEWAY maps integers to in- 
tegers. The machine obtains a random number N1 and prints 
ONEWAY(N1). The player thinks of a number N2 and types it in. 
The machine initializes a random number generator with the sum 
N1 + N2. After the hand is completely over and before the 
start of a new deal, the machine prints out N1 which enables 
the player to check on the machine's honesty. 


17.3 The game is ill-formed. From a decision graph stand- 
point there are an infinitude of nodes and every terminal 
state is avoided by A whose best interests lie in prolonging 
the game until B's wallet is exhausted. 


17.5 Variables which can't be used are those indicated as 
temporary. They all begin with 'Q' so that programs using 
QUEST should avoid them. As a precaution to their forgetting, 
one can insert 

QN pos(0) ‘or : S (ERROR) 
after label QUESTP 1. 


17.7 After the check for '...' insert: 


QVP POS(0) LEN(1) . QC1 '-' LEN(1) . QC2 RPOS (0) 
+ :F(QUESTP 4) 
ALPHABET BREAK (QC1) BREAK(QS) :F (FRETURN) 
REVERSE (SALPHAPET) BREAK (QC2) BREAK (QS) :F (FRETURN) 
FQ (SIZE (QS), 1) :S(QUESTP 3)F(FRETURN) 
QUESTP 
17.9 Replace J = 0 by LIST = MAX. Replace: 
J = J + 1 LT(J,MAX) 
by 
LIST BREAK(,) eJ , { (LEN(1) REM) . J = 


As a matter of aesthetics, the name 'MAX' could be changed. 
17.11 For both cases, 8 X 3 X 2 = 48 


17.1 3xXx226 
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15 Add: EQ(V,1) :S(TTTM_8) immediately before TTTM_3. 
17.17 Replace SALPHABET by ORD_ALPHA which is defined as: 


FULL DECK  LEN(13) . SA LEN(13) . SB 


+ LEN(13) . SC LEN(13) . SD 
ORD ALPHA =  BLEND(BLEND (SA,SB) ¿BLEND (SC, SD) ) 
17.19 
LOOP H =  VALS(RHAND(13,1)) 
OUTPUT = 4 * COUNT(H,'M') + 3 * COUNT(H,'L') 
+ 2 * COUNT(H,'K') +  COUNT(H,'J') =: (LOOP) 


17.21 The problem lies with the FLUSH test. It should 
properly go after the test for a full house. Thus 2H 2H 2H 3H 
3H should be interpreted as a full house. The initial pairs 
test was inserted for speed. This could be left out sim- 
plifying the result. 


17.23 Setting VALS = W V and doing a :(PR(2)) is good enough 

for a uniform distibution but won't distinguish between hands 

that contain the same pairs but differ in only the fifth card. 

Hence, replace the W V in the call to PR by the expression: 
FASEB (CONVERT ( (CONVERT (DECOMB(W V),'REAL') / COMB(13,2)) 

+ * 13 ** 2, CINTEGER!), 13) 


17.25 After HE_BETS insert: 
QUEST ("How much? /BET(1...BET) ') 
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—__———— Chapter 18 2I==i============== 


18.1 One method is to insert integers rather than strings in- 
to the table. Thus, instead of inserting ‘'2F', insert 
BASE10('2F', 16). Another, perhaps extreme, method is to com- 
bine all elements of a table into a long string and use pat- 
tern matching to extract an element. 


18.3 PUNCH = CH(OP AC X A) (Using Prog. 2.7). 


18.5 The single quote (') cannot be used. The solution is to 
use the QUOTE function (Prog. 3.16). 


18.7 Assuming CRNAME() returns a unique created name: 
IDTBL =  TABLE() 
IDEN = ... S('ID') 


S_ID T = POP() 
(DIFFER(IDTBL<T>) PUSH (IDTBL<T>) ) : S (NRETURN) 
IDTBL<T> = CRNAME() 


PUSH (IDTBL<T>) : (NRETURN) 


o DOUE TOS FOr Chapter 18: ts oe vs Page 459 
18.9 
Q = tee ee 
L2 X = INPUT ¿F<CODE(S ' ; +: (DONE) ')> 
X '<'  BREAK('>') . K '>9! = K 
X lo.=f = e = 6 
L X '€«'  BREAK('>') . K !'»' = Qt *' K ! tQ :S (L) 
X = REPLACE(X,'|','«') Q 
L1 X '<t = q tet Q DAL 
S = S X ts! : (L2) 
DONE 


1 ALPHA( H) would be converted to ALPHA('*'). 

NLSTMT - "e , *PUSH() BL 

STMT = IFSTMT | ASGNSTMT || NLSTMT 
18.1 Writing DNF is obvious. We then replace *E of ASGNSTMT 
(*E | *DNF('assignment operator (=)', 'EXPRESSION!) 


Replace the BOOL of IFSTMT by 
(BOOL | *DNF('IF keyword', 'relation'!)) 


etc e 
18.17 ATREE = BREAK(':,') (',' | ':' SPAN('0123456789') $ N 
+ ',' *EVAL(DUPL('*ATREE ',N))) 
18. 19 

POLISH  'COMMA:' SPAN('0123456789') . N ',' 
E ARB TREE . T 'COMMA:2* = ‘'COMMA:' (N+ 1) T 
18.21 At TR REF, after extracting the ID, apply the predicate 
ATEST (ID). If this fails, branch to TR FREF defined as 
follows. 
TR FREF POLISH POS(0) 'COMMA:2,' = :F(TR FREF 1) 

TR = TR TR() 'ARG,' POP() '//' 3: (TR_FREF) 
TR_FREF1 TR = TR TR() 'ARG,' POP() '//' 
TR = TR 'CALL,' ID ',,' PUSH(TEMP()) "'//' 
: (RETURN) 

18.23 
TU ADD ;TU MUL ;TU FADD ;TU FMUL 

ISREG (ARG1) :S (TU, SUB) 

R = ISREG(ARG2) :F(TU SUB) 
OUTPUT = ' '!OP ! Y R *,* ADDR(ARG1) 
DEASSOC (R) 
STORE (R, ARG3) : (RETURN) 


TU_SUB ;TU_DIV ;TU_FSUB ;TU_FDIV 
18.25 (a) 3, (b) 6, (c) Returns the successor of a number. 


18.27 #DEF,CS,<#DEF,S, *S; X15; 
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l Is | 
| Program Number References referenced | 
i by l 
| secessus eser ume piment pines emen cede E Se E qo | 
| | 
j AGT 3.13 UPLO l 
| AI 4.6 SEQ FRSORT | 
| AOPA 4.4 SEQ I 
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i ARC 15.8 SQRT | 
| DEXP | 
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by 

BNORM 10.1 REVERSE INORM 
LINE 

BREAKX 8.2 COUNT 
REPL 
IMAGE 
RCHAR 

BRKREM 8.1 DIFF 

BSORT 13.1 

CARDPAK 17.5 RPERMUTE POKEV 

ORDER POKER 

CATA 4.8 SEQ 

CEIL 15.5 DEXP 

CH 2.7 BASE10 

COMB 15. 1 DECOMB 
POKEV 

COPYL 5.8 

COUNT 3.4 BREAKX CRACK 
SPACING 
MINP 
FRSORT 

CRACK 4.1 COUNT FRSORT 

DAY 2.8 

DECOMB 15.2 COMB POKEV 

DEXP 14. 1 CEIL 
TRIG 
ARC 
LOG 
RAISE 
PHRASE 
POL 

DEXTERN 14.2 


Program 


FASTBAL 
FIND 

FLD 
FORTPUT 
FLOOR 
FORTREAD 
FPROFILE 


FRSORT 


FTRACE 


GFM 


HEX 
HSORT 


HY PHENATE 


IMAGE 


INFINIP 
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Number 
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STRINGOUT 
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SPACING 
BREAKX 
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SWAP 
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BLANKS 


SNOREAD 


VISIT 


LINE 
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Program 


INSERT 
INSERTB 
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IP 


L_ONE 


LAST 


LEXGT 
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LINE 


LINEARIZE 


LOG 


LPAD 


Number 


13.8 


13.10 


14.4 


10.3 


13.9 
15.9 


3.2 


12.5 


11.5 


References 


REVERSE 


PAD 


SUBSTR 


MINP 
BNORM 


HYPHENATE 


DEXP 
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MSORT 


TR 


RAISE 


PUT 


ONEWAY 
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FPROFILE 
TPROFILE 
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ONCE 


ONEWAY 


OR 


ORDER 


PAD 


PARAGRAPH 
PEEL 

PERM 

PERMS 
PERMUTATION 


PHRASE 


PHYSICAL 
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SPACING 
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iP 


PUSH 
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LPAD 
BASEB 
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DEXP 
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REDEFINE 
ORDER 


Is 


referenced 


LINE 
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REVERSE 
COMB 
BASE10 
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POL 18.4 DEXP 
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POP 
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GPM 
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Program Numbe 
RANDOM 16.1 
RCHAR 16.5 
READ 9.1 
READL 5. 1 
READRL Da 2 
REDEFINE 14.5 
REORDER 12.4 
REPL 3.15 
RESOLUTION 11.1 
REVERSE 3.6 
REVL 5.3 
ROMAN 2.3 
ROTATER 325 
RPAD 3.3 
RPERMUTE 16.3 


r References 


RANDOM 
BREAKX 


BREAKX 


RANDOM 


Is 
referenced 


RAMM 
RPERMUTE 
RCHAR 
RSELECT 
RSEASON 


RWORD 


FORTREAD 
PARAGRAPH 
SNOREAD 
TREEREAD 


INFINIP 
PHYSICAL 


QUOTE 
STACK 


TIMER 
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BALREV 
BNORM 
LINE 
PAD 
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POKEV 


POKEV 


PUT 
ASM 


ONEWAY 
CARDPAK 
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RSTORY 
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SCAN 


SEQ 
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SNOPUT 


SNOREAD 


SPACING 


SPELL 


SQRT 


SSORT 
STACK 


STATEF 


Number 


16.8 


References 


BAL 
RSELECT 


RSENTENCE 
RCHAR 


PUSH 
POP 


DIFF 


PUT 
PEEL 


READ 
FASTBAL 


COUNT 
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